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ABSTRACT 

desirability, feasibility, and potential impact of two reporting practices 
for National Assessment of Educational Progress (NAEP) results: 
district-level reporting and market-basket reporting. NAEP ' s sponsors believe 
that reporting district-level NAEP results would support state and local 
education reform efforts. The proposal for a market-based approach to NAEP 
reporting is based on the belief that the large-scale release of a 
market-basket set of items could demystify the assessment by providing many 
examples of the content and skills assessed and the format of items. Using 
percent correct scores to summarize performance on the market basket would be 
an attempt to make test results more user friendly. Data from a variety of 
sources, including workshops attended by representatives of the National 
Assessment Governing Board, the National Center for Education Statistics, and 
other interested parties, were used to examine these issues and those posed 
by the use of a NAEP short form. Market research emphasizing both needs 
analysis and product analysis is necessary to evaluate the level of interest 
in district-level reporting, and the decision to move ahead should be based 
on real interest. Any decisions about the configuration of the NAEP market 
basket will involve tradeoffs, and some methods would result in simpler 
procedures than others without supporting the desired inferences. If the 
decision is made to design a market basket for the NAEP, its configurations 
should be based on a clear articulation of the purposes and objectives of the 
market basket. These new reporting practices would provide information that 
would receive attention from new audiences for NAEP results. The potential of 
these reporting methods for significant impact on curriculum and assessment 
at local levels is high. If either is implemented, program sponsors should 
develop intensive support systems. One appendix discusses the background and 
current uses of the consumer price index, and the other illustrates a 
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Executive Summary 



Since 1969, the National Assessment of Educational Progress (NAEP) 
has been assessing educational attainment across the country. Mandated 
by Congress, NAEP surveys the educational accomplishments of students 
in the United States, monitors changes in achievement, and provides a mea- 
sure of student learning at critical points in their school experience. NAEP 
results are summarized for the nation as a whole and for individual states 
with sufficient numbers of participating schools and students. 

NAEP s sponsors believe that NAEP could provide useful data about 
educational achievement below the state level. They suggest that below 
state results “could provide an important source of data for informing a 
variety of education reform efforts at the local level” (National Assessment 
Governing Board, 1995b). In addition, district-level reporting could pro- 
vide local educators with feedback in return for their participation in NAEP, 
something that NAEPs sponsors believe might increase motivation to 
participate in the assessment. Reporting results below the state level was 
prohibited until 1994. The Improving Americas Schools Act of 1994, 
which reauthorized NAEP in that year, removed the language prohibiting 
below-state reporting and set the stage for consideration of reporting 
district-level and school-level results. 

At the same time, NAEP s sponsors have been taking a critical look at 
their reporting procedures with an eye toward improving the usefulness 
and interpretability of reports. An overarching principle in their recent 
redesign policy is to define the audience for NAEP reports and to vary the 
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kind and amount of detail in reports to make them most useful for the 
various audiences. Accordingly, NAEPs sponsors have funded studies to 
examine the ways in which reports are used by policy makers, educators, 
the press, and others and to identify common misuses and misinterpreta- 
tions of reported data. 

Within the context of the redesign proposals, the idea of market-bas- 
ket reporting emerged as a way to better communicate what students in the 
United States know and are able to do at grade levels tested by NAEP. The 
market-basket concept is based on the idea that a relatively limited set of 
items can represent some larger construct. NAEPs sponsors draw parallels 
between the proposed NAEP market basket and the Consumer Price Index 
(CPI) . The proposed NAEP market basket would consist of a publicly re- 
leased collection of items intended to represent the content and skills as- 
sessed. Percent correct scores, a metric NAEPs sponsors believe is widely 
understood, will be used to summarize performance on the collection of 
items. 



STUDY APPROACH 

At the request of the Department of Education, the National Research 
Council formed the Committee on NAEP Reporting Practices to address 
questions about the desirability, feasibility, and potential impact of imple- 
menting these reporting practices. The committee developed study ques- 
tions designed to address issues surrounding district-level and market-bas- 
ket reporting. Study questions focused on the: 

• characteristics and features of the reporting methods, 

• information needs likely to be served, 

• level of interest in the reporting practices, 

• types of inferences that could be based on the reported data, 

• implications of the reporting methods for NAEP, and 

• implications of the reporting methods for state and local educa- 
tion programs. 

To gather information on these issues, the committee reviewed the 
literature and policy statements on these two reporting practices; invited 
representatives from the National Assessment Governing Board (NAGB) 
and the National Center for Education Statistics (NCES) to attend their 
meetings and present information; attended NAGB board and sub- 
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committee meetings; held a discussion during the Large Scale Assessment 
Conference sponsored by the Council of Chief State School Officers 
(CCSSO); arranged for a briefing on the CPI; and convened two multiday 
workshops. One workshop focused on district-level reporting, the other 
addressed market-basket reporting. 

DISTRICT-LEVEL REPORTING 

NAEPs sponsors believe that reporting district-level NAEP results 
would support local and state education reform efforts. Their rationale is 
that reporting NAEP performance for school districts has the potential to 
enable comparisons that cannot be made based on existing assessment re- 
sults: comparisons of district-level achievement results across state bound- 
aries and comparisons of district-level results with national assessment data. 

Opinions about the desirability of such data are varied. Some partici- 
pants in the committees workshop believed the information would be 
uniquely informative. For example, comparisons among districts with simi- 
lar demographic characteristics would allow them to identify those per- 
forming better than expected and instructional practices that work well. 
Others were attracted to the prospect of having a means for external valida- 
tion and considered NAEP to be a stable external measure of achievement 
for making comparisons with their state and local assessment results. 
Another appealing feature to workshop participants was the possibility of 
assessment results in subject areas and grades not tested by their state or 
local programs. In addition, NAEP collects background data that many 
states and districts do not have the resources to collect, and they would 
look forward to receiving reports that associate district-level performance 
with background and school environmental data. 

Other workshop participants were wary of the ways data might be 
used. Officials from some of the larger urban areas maintained that they 
were already aware that their children do not perform as well as those from 
other districts. Another set of assessment results would provide yet another 
opportunity for the press and others to criticize them. Some expressed 
concern about alignment issues, noting that their curricula do not necessar- 
ily match the material tested on NAEP. Attempts to use NAEP as a means 
of external validation for the state assessment would be problematic when 
the state assessment is aligned with instruction and NAEP is not, particu- 
larly if results from the different assessments suggest different findings about 
student achievement. 
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Given workshop participants* comments and the materials reviewed, 
the committee’s most overriding concern about developing a program for 
reporting district-level NAEP results relates to districts’ and states’ levels of 
interest. Previous attempts at reporting district results (in 1996 and 1998) 
indicated virtually no interest in receiving district-level summaries of per- 
formance. Workshop participants’ reactions were mixed, in part due to the 
lack of information about the goals, objectives, specifications, and costs of a 
district -level program. It is not clear what district-level reporting is intended 
to accomplish or whom the program would serve — only large urban dis- 
tricts or smaller districts as well. Decisions have not been made about the 
types of information districts and states would receive, when they would 
receive the information, how much it would cost, or who would pay the 
costs. These details need to be resolved before NAEP s sponsors can expect 
to gauge actual interest in receiving district-level results. Once the details 
are specified, then it is important to determine if there is sufficient interest 
to justify pursuing the program. On these points, the committee offers the 
following recommendations: 

RECOMMENDATION: Market research emphasizing both 
needs analysis and product analysis is necessary to evaluate 
the level of interest in district-level reporting. The decision to 
move ahead with district-level reporting should be based on 
the results of market research conducted by an independent 
market-research organization. If market research suggests that 
there is little or no interest in district-level reporting, NAEP’s 
sponsors should not continue to invest NAEP’s limited re- 
sources pursuing district-level reporting. 

RECOMMENDATION: If the decision is made to proceed 
with district-level reporting, NAEP’s sponsors should develop 
and implement a plan for program evaluation, similar to the 
research conducted during the initial years of the Trial State 
Assessment, that would investigate the quality and utility of 
district-level NAEP data. 
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MARKET-BASKET REPORTING 
Large-Scale Release of Items 

The proposal for a NAEP market basket emanates from desires to make 
NAEP results more meaningful and more easily understood. According to 
workshop participants, the large-scale release of a market-basket set of items 
could demystify the assessment by providing many examples of the content 
and skills assessed and the format of items. Review of the content and skill 
coverage could stimulate discussions with local educators about their cur- 
ricula and instructional programs. Review of NAEP item formats could 
lead to improvements in the design of items used with state, local, and 
classroom-based assessments. In addition, public review of the released 
materials could enhance understanding of the goals and purposes of the 
national assessment and might lead to increased public support for testing. 
Although workshop participants were generally positive about a large-scale 
release of items, they noted that a large release could be overly influential 
on local and state curricula or assessments. For instance, policy makers and 
educators concerned about their NAEP performance could attempt to align 
their curricula more closely with what is tested on NAEP. Assessment, cur- 
ricula, and instructional practices form a tightly woven system — making 
changes to one aspect of the system can have an impact on other aspects of 
the system. 



Percent Correct Scores 

Using percent correct scores to summarize performance on the market 
basket is intended to make test results more user friendly. Because nearly 
everyone who has passed through the American school system has at one 
time or another been tested and received a percent correct score, most 
people could be expected to understand such scores. NAEP’s sponsors 
believe that percent correct scores would have immediate meaning to the 
public. 

Based on workshop participants’ reactions, it is doubtful that percent 
correct scores would be more easily understood than the achievement-level 
results that NAEP currendy reports. Many users have become accustomed 
to achievement level reporting; moving to a percent correct metric would 
require new interpretative assistance. Further, the use of this metric pre- 
sents a number of challenges. For example, it is unclear whether percent 
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correct scores would indicate percent of questions answered correctly or 
percent of possible points — both pose complications due to the mix of 
multiple-choice items and constructed response tasks scored on multipoint 
scales. In addition, NAEP results are not currendy reported on a percent 
correct metric. 



The NAEP Short Form 

With the NAEP short form, the release of exemplar items would be 
smaller, but an intact form would be provided to states and districts to 
administer as they see fit. NAEPs sponsors hope that the short form will 
enable faster and more understandable reporting of results. Initial plans 
call for a fourth-grade mathematics short form, but the ultimate plan might 
be to develop short forms for a variety of subjects for states to use in years 
when NAEP assessments are not scheduled for particular subjects. The 
policy guiding development of the short forms stipulates that results be 
reported using NAEP achievement levels. 

Many workshop participants found the idea of the short form to be 
appealing, but their comments reflected a variety of conceptions about the 
characteristics of a short form. Several envisioned the short form as a set of 
items that could be embedded into existing assessments to link results from 
state and local assessments with NAEP, while others viewed the short form 
as a mechanism for providing more timely reporting of NAEP results. Still 
others see it as a means for facilitating district-level or school-level report- 
ing. It is not clear that all of these desired uses would be supported. 

These widely divergent conceptions are exacerbated by the limited 
policy guidance NAGB has provided. While the generality of policy state- 
ments is appropriate so that developers are not limited in the approaches 
they might consider to put policy into practice, the lack of detail makes the 
short form an amorphous concept open to a variety of interpretations. Too 
many details remain undecided for the committee to make specific recom- 
mendations about the short form. 

CONCLUSION: Thus fir, the NAEP short form has been 
defined by general NAGB policy, but it has not been developed 
in sufficient technical and practical detail to allow potential 
users to react to a firm proposal. Instead, users are projecting 
into the general idea their own desired characteristics for the 
short form, such as an anchor for linking scales. Some of their 
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ideas and desires for the short form have already been 
determined to be problematic. It will not be possible for a 
short- form design to support all uses described by workshop 
participants. 



Long Market Baskets versus Short-Form Market Baskets 

All configurations for the market basket will involve tradeoffs. A mar- 
ket basket comprised of a large collection of items is more likely to be 
representative of the NAEP frameworks. As currently conceived, the NAEP 
short forms consist of approximately 30 items to be administered during a 
60-minute testing period. A collection this small is unlikely to adequately 
represent the NAEP frameworks. Deriving results from the short form that 
are representative of the NAEP frameworks, technically sound, and compa- 
rable across versions of the short forms and to main NAEP results pose 
significant challenges. On these points, the committee makes the follow- 
ing recommendation. 

RECOMMENDATION: All decisions about the configuration 
of the NAEP market basket will involve tradeoffs. Some meth- 
ods for configuring the market basket would result in simpler 
procedures than others but would not support the desired 
inferences. Other methods would yield more generalizable 
results but at the expense of simplicity. If the decision is made 
to proceed with designing a NAEP market basket, its configu- 
ration should be based on a clear articulation of the purposes 
and objectives for the market basket. 

ANALOGIES WITH THE CONSUMER PRICE INDEX 
MARKET BASKET 

Because analogies have frequently been made between the NAEP 
market basket and the Consumer Price Index (CPI), the committee investi- 
gated the extent to which such comparisons hold. In considering the pro- 
posals to develop and report a summary measure from the existing NAEP 
frameworks, the committee realized that the purpose and construction of 
the CPI market basket differs fundamentally from the corresponding ele- 
ments of current NAEP proposals. The task of building an educational 
parallel to the CPI is formidable and appears to differ conceptually from 
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the current NAEP market-basket development activities. Implementing a 
true “market- basket” approach in NAEP would thus necessitate major 
operational changes. Most importantly, a large national survey would need 
to be conducted to determine what students are actually taught in U.S. 
classrooms. Survey results would be used to construct the market basket, 
and then students would be tested to evaluate performance on the market 
basket. 

Furthermore, the market-basket metaphor may be inappropriate. A 
market basket is an odd, even jarring image in the context of educational 
achievement. Most people do not see education as a consumer purchase or 
an assortment of independent parcels placed in a shopping cart. On these 
points, we find: 

CONCLUSION: Use of the term “market basket” is mislead- 
ing because (1) the NAEP frameworks reflect the aspirations 
of policy makers and educators and are not purely descriptive 
in nature and (2) the current operational features of NAEP 
differ fundamentally from the data collection processes used 
in producing the CPL 

RECOMMENDATION: In describing the various proposals 
for reporting a summary measure from the existing NAEP 
frameworks, NAEP’s sponsors should refrain from using the 
term “market basket” because of inaccuracies in the implied 
analogy with the CPL 

RECOMMENDATION: If, given the issues raised about market- 
basket reporting, NAEP’s sponsors wish to pursue the develop- 
ment of this concept, they should consider developing an edu- 
cational index that possesses characteristics analogous to those 
of the Consumer Price Index: (1) is descriptive rather than 
reflecting policy makers’ and educators’ aspirations; (2) is 
reflective of regional differences in educational programs; and 
(3) is updated regularly to incorporate changes in curriculum. 

ENHANCING REPORTS 

When NAEP began reporting state-level results in 1990, researchers 
and others expressed concerns about potential misinterpretation or misuse 
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of the data. Although not all the dire predictions came true, reports of 
below-state NAEP results increase the potential for misinterpretation prob- 
lems. Given the amount of attention that below-state results would likely 
receive, whether derived from main NAEP or via a NAEP short form, sig- 
nificant attention should be devoted to product design. 

As part of our study, the committee hoped to be able to review 
prototypic reports for the proposed reporting methods. While some pre- 
liminary examples of district-level and market-basket reports were avail- 
able, NAEP’s sponsors have not made definitive decisions about the format 
of reports. Given the stage of report design, we conducted a review of the 
literature on NAEP reporting procedures and examined examples of NAEP 
reports. Based on these reviews, we offer suggestions and recommendations 
for report design. 

The design of data displays should be carefully evaluated and should 
evolve through methodical processes that consider the purposes of the data, 
the needs of users, the types of interpretations, and the anticipated types of 
misinterpretations. User-needs analysis is an appropriate forum for deter- 
mining both product design and effective metaphors for aiding in commu- 
nication. 

Even if the proposals for district-level and market-basket reporting are 
not implemented, attention to the way NAEP information is provided 
would be useful. The types of NAEP reports are many and varied. The 
information serves many purposes for a broad constellation of audiences, 
including researchers, policy makers, the press, and the public. The more 
technical users as well as the lay public look to NAEP to support, refute, or 
inform their ideas about students’ academic accomplishments. The mes- 
sages taken from NAEP s data displays can easily influence their percep- 
tions about the state of education in the United States. We therefore rec- 
ommend: 

RECOMMENDATION: Appropriate user profiles and needs 
assessments should be considered as part of the integrated design 
of district-level and market-basket reports. The integration of 
usability as part of the overall design process is essential 
because it considers the information needs of the public. 

RECOMMENDATION: The text, graphs, and tables of 
reports developed for market-basket and district-level report- 
ing should be subjected to standard usability engineering tech- 
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niques including appropriate usability testing methodologies. 

The purpose of such procedures would be to make reports 
more comprehensible to their readers and more accessible to 
their target audiences. 

IMPLICATIONS FOR NAEP AND 
LOCAL EDUCATIONAL SYSTEMS 

The two proposed reporting practices would provide new information 
that would receive attention from new audiences — audiences that may have 
not previously attended to NAEP results. The use of such information by 
policy makers, state and local departments of education, the press, and the 
lay public could have a significant impact on NAEP and on state and local 
assessment, curriculum, and instruction. In addition, these reporting meth- 
ods pose challenges for NAEP s current procedures, including item devel- 
opment, sampling procedures, analytic and scoring methodologies, and 
report preparation. 

NAEP has traditionally been a low-stakes assessment, but reporting 
results at a level closer to those responsible for instruction raises the stakes. 
With higher stakes comes the need to pay greater attention to security 
issues. In addition, motivation to do well may increase, which could affect 
the comparability of NAEP results across time and across jurisdictions, 
depending on how jurisdictions use the new results. 

Introducing new products and procedures to an already complex sys- 
tem has significant cost and resource implications. To construct short forms 
and to accommodate security considerations, item development would need 
to be stepped up. Sampling procedures would need to be altered and addi- 
tional students tested to support district-level results. Analytic methodolo- 
gies would need to be adapted. The types and numbers of reports to be 
produced would affect report preparation, possibly increasing the length of 
time to release results. These factors would require fundamental changes in 
NAEP’s processes, operations, and products. 

For local education systems, the reporting practices could increase the 
attention on NAEP results. Current assessments might be replaced or 
altered to accommodate NAEPs schedule or to be modeled more closely 
after the NAEP frameworks and item formats. There could be efforts to 
align instructional programs more closely with the NAEP frameworks. If 
NAEP were to report percent correct scores, states and districts might con- 
sider following suit for local assessments and change to a metric that may 
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not lead to improved understanding of NAEP or local test results. It is not 
clear that these changes would be beneficial to local education systems, and 
the implementation of these reporting approaches would require support 
systems to aid districts and states in appropriate uses and interpretations of 
the reported results. We therefore recommend: 

RECOMMENDATION: The potential is high for significant 
impact on curriculum and /or assessment at the local levels. If 
either district-level reporting or market-basket reporting, with 
or without a short form, is planned for implementation, the 
program sponsors should develop and implement intensive 
support systems to assist districts and states in appropriate 
uses and interpretations of any such NAEP results reported. 
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Introduction 



Since 1969, the National Assessment of Educational Progress (NAEP) 
has been assessing students across the country (U.S. Department of Educa- 
tion, 1999). Since its inception, NAEP has summarized academic perfor- 
mance for the nation as a whole and, beginning in 1990, for the individual 
states. Reporting results below the state level was prohibited until 1994. 
The Improving Americas Schools Act of 1994, which reauthorized NAEP 
in that year, removed the language prohibiting below-state reporting and 
set the stage for consideration of reporting district-level and school-level 
results. 

NAEPs policy-making body believes “below state results could pro- 
vide an important source of data for informing a variety of education re- 
form efforts at the local level” (National Assessment Governing Board, 
1995a). Some districts have expressed interest in district-level NAEP with 
an eye toward augmenting their current assessments, filling in gaps for con- 
tent areas not currendy tested or even substituting NAEP instruments for 
those measures that have been locally developed or purchased (National 
Research Council, 1999c). NAEPs sponsors have also suggested district- 
level reports could increase motivation for districts’ participation in the 
assessment by providing them with feedback on performance in return for 
their participation. 

At the same time, NAEPs sponsors have taken a critical look at their 
reporting methods with the objective of improving the usefulness and in- 
terpretability of reports (National Assessment Governing Board, 1 996; Na- 
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tional Assessment Governing Board, 1999a). NAEP’s sponsors have at- 
tempted over the years to produce reports of achievement results that were 
more usable by lay audiences and that contain more easily interpreted dis- 
plays of the information. NAEP has experimented with a variety of ap- 
proaches including, for example, reports that utilize a newspaper format, 
specific brochures of topical areas, and reports with easier- to- read graphs 
and tables (U.S. Department of Education, 1999). They have funded stud- 
ies to examine the ways in which reports are used by policy makers, educa- 
tors, the press, and others and to identify misuses and misinterpretations of 
reported data (Hambleton & Slater, 1996; Jaeger, 1995; Hambleton & 
Meara, 2000). 

In addition, NAEP has attempted to design and introduce innovative 
research approaches to help with the interpretation of the data. Along this 
vein, advisers to the National Assessment Governing Board (NAGB) have 
proposed the use of “market-basket” reporting methods as another means 
to accomplish simpler reporting that may be more useful to NAEP’s audi- 
ences (Forsyth, Hambleton, Linn, Mislevy, & Yen, 1996). Like the Con- 
sumer Price Index (CPI), which presents information on inflation by mea- 
suring price changes on a “market basket” of goods and services, a 
market-basket NAEP report would present information on student achieve- 
ment based on a “market basket” of knowledge and skills in a content area. 
Under one scenario, for example, NAEP would report results as percent- 
ages of items correct on sets of representative items, an approach to report- 
ing that could lead to easier-to-understand reports of student achievement. 
As part of their evaluation of NAEP, the National Research Council’s Com- 
mittee on the Evaluation of National and State Assessments of Educational 
Progress stressed the need for clear and comprehensible reporting metrics 
that would simplify the interpretation of results and encouraged explora- 
tion of market-basket reporting for NAEP (National Research Council, 
1999b). Market-basket reporting might be expected to provide an easier- 
to-understand picture of students’ academic accomplishments. 

In pursuit of the goals of improved reporting and use of test results, 
NAEP’s sponsors were interested in exploring the feasibility and potential 
impact of both district-level and market-basket reporting practices as well 
as the possible connections between them. Accordingly, at the request of 
the U.S. Department of Education, the National Research Council (NRC) 
established the Committee on NAEP Reporting Practices to study these 
reporting practices. Because these two topics are intertwined, the commit- 
tee is examining them in tandem. 
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The committee developed two sets of study questions to address issues 
associated with district-level and market-basket reporting. With regard to 
district-level reporting, the committee examined the following: 

1. What are the proposed characteristics of a district-level NAEP? 

2. If implemented, what information needs might it serve? 

3. What is the degree of interest in participating in district-level 
NAEP and what are the factors that would influence interest? 

4. Would district-level NAEP pose any threats to the validity of 
inferences from national and state NAEP? 

3. What are the implications of district-level reporting for other state 
and local assessment programs? 

With respect to market-basket reporting, the committee investigated 
the following: 

1 . What is market-basket reporting? 

2. How might reports of market-basket results be presented to 
NAEP s audiences? Are there prototypes? 

3. What information needs might be served by market-basket 
reporting for NAEP? 

4. Are market-basket results likely to be relevant and accurate enough 
to meet these needs? 

5. Would market-basket reporting pose any threats to the validity of 
inferences from national and state NAEP? What types of infer- 
ences would be valid? 

6. What are the implications of market-basket reporting for other 
national, state, and local assessment programs? What role might 
an NAEP short form play? 

In addressing these issues, the committee considered the future context 
in which NAEP may be operating. For instance, the National Center for 
Education Statistics (NCES) set a priority to have all states sign up for 
NAEP and secured participation agreements with 48 states for the assess- 
ment in 2000. For numerous reasons, however, several states were unable 
to successfully take part in the assessment. In two states, one large district 
refused to participate, making it impossible for each of these states to meet 
participation criteria. Similarly, other states were unable to secure partici- 
pation of enough schools to meet the threshold criteria. In fact, even some 
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states that enacted legislation mandating state NAEP participation were 
unable to garner the necessary interest to meet the inclusion criteria 
(National Center for Education Statistics, 2000a). 

Tied to the increasing difficulty in securing participation for NAEP is 
the proliferation of assessment programs in general. Because of state educa- 
tion reforms and the requirements of federal education legislation (e.g., 
Improving Americas Schools Act (IAS A), Individuals with Disabilities Edu- 
cation Act (IDEA), and Carl Perkins Act), state assessment programs have 
expanded greatly in both scope and complexity in the past decade (Council 
of Chief State School Officers, 2000). Similarly, many local school dis- 
tricts, particularly the large urban school districts so important to state 
NAEP sampling strategies, have expanded the use of assessment instru- 
ments in their own testing programs (National Research Council, 1999c). 

Further, a potential factor in the changing context of NAEP is the 
proposal to make NAEP a more “high-stakes” measure by connecting re- 
wards and/or sanctions to states’ performance. For example, in its fiscal 
2001 budget, the Clinton administration proposed a “Recognition and 
Reward Program” that would provide “high performance bonuses to states 
that make exemplary progress in improving student performance and clos- 
ing the achievement gap between high- and low-performing groups of stu- 
dents” (National Center for Education Statistics, 2000b:2). While at the 
time of the writing of this report, it is impossible to predict if this proposal 
will be enacted, it remains a distinct possibility. 

STUDY APPROACH 

To gather information on the issues surrounding market-basket and 
district-level reporting, the committee reviewed the literature on these two 
topics, invited representatives from NAEP s sponsoring agencies (NAGB 
and NCES) to attend meetings and present information, attended NAGB 
board and subcommittee meetings, held a discussion during the Large- 
Scale Assessment Conference sponsored by the Council of Chief State 
School Officers (CCSSO), and conducted two multiday workshops spe- 
cifically on these two topics. The workshops addressed key issues from a 
variety of perspectives. The purpose of the NRCs Workshop on District- 
Level Reporting for NAEP was to explore with various stakeholders their 
interest in and perceptions regarding the likely impacts of district-level 
reporting. Similarly, the purpose of the NRCs Workshop on Market-Basket 
Reporting was to explore with various stakeholders their interest in and 
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perceptions regarding the desirability, feasibility, and potential impact of 
market -basket reporting for NAEP. Chapter 3 provides additional details 
about the workshop on district-level reporting; additional information 
about the workshop on market-basket reporting is included and Chapters 
4 and 5. 



WHAT IS DISTRICT-LEVEL REPORTING? 

When first implemented, NAEP results were reported only for the 
nation as a whole. Following congressional authorization in 1988, the Trial 
State Assessment was initiated which allowed reporting of results for par- 
ticipating states, although below-state reporting was still prohibited. The 
1994 reauthorization of NAEP removed this prohibition, but the law nei- 
ther called for district or school-level reporting nor did it outline details 
about how such practices would operate. While NAGB and NCES have 
been exploring the issues associated with providing district-level results, the 
policies for district -level reporting are not yet in place nor are the details to 
guide program implementation. 

WHAT IS MARKET-BASKET REPORTING? 

Market-basket reporting was first discussed in connection with NAEP s 
redesign in 1996 (National Assessment Governing Board, 1996) and was 
again included in the most recent redesign effort, NAEP Design 2000- 
2010 (National Assessment Governing Board, 1 999a) . The market-basket 
concept is based on the idea that a limited set of items can represent some 
larger construct. The most common example of a market basket is the 
CPI, produced by the Bureau of Labor Statistics. The CPI tracks price 
changes paid by urban consumers in purchasing a locally representative set 
of consumer goods and services. The CPI measures monthly cost differen- 
tials for products in its market basket; therefore, the CPI is frequendy used 
as an indicator of change in the U.S. economy. The CPI market-basket 
concept resonates with the general public; it invokes the tangible image of a 
shopper at the market filling a basket with a set of goods regarded as broadly 
reflecting consumer spending patterns at www.states.bls.gov (Bureau of 
Labor Statistics, 1999). 

The general idea of a NAEP market basket draws on a similar image: a 
collection of test questions representative of some larger content domain; 
and an easily understood index to summarize performance on the items. 
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There are two components of the NAEP market basket: the collection of 
items and the summary index. The collection of items could be large 
(longer than a typical test form given to a student) or small (small enough 
to be considered an administrable test form). The summary index cur- 
rendy under consideration is the percent correct score (Mazzeo, 2000). 

There are a number of configurations for a NAEP market basket. We 
discuss several in Chapter 4 of this report. To acquaint the reader with the 
basic ideas and issues associated with market-basket reporting, two alterna- 
tive scenarios are portrayed in Figure 1-1. 

Figure 1-1 presents a diagram of various components of the market 
basket and describes two alternate configurations. Under one scenario, a 
large collection of items would be assembled and released publicly. To 
adequately cover the breadth of the content domain, the collection would 
be much larger than any one of the forms used in the test and probably too 
long to administer to a single student at one sitting. This presents some 
challenges for the calculation of the percent correct scores. Because no 
student would take all of the items, complex statistical procedures would be 
needed for estimating scores. This alternative appears in Figure 1-1 as 
“scenario one.” 

A second scenario involves producing multiple “administrable” test 
forms (called “short forms”). Students would take an entire test form, and 
scores could be based on students’ performance for the entire test in the 
manner usually employed by testing programs. Although this would sim- 
plify calculation of percent correct scores, the collection of items would be 
much smaller and less likely to adequately represent the content domain. 
This scenario also calls for assembling multiple test forms. Some forms 
would be released to the public, while others would remain secure, perhaps 
for use by state and local assessment programs, and possibly to be embed- 
ded into or administered in conjunction with existing tests. This alterna- 
tive appears in Figure 1-1 as “scenario two.” 

ORGANIZATION OF THIS REPORT 

This report begins with an overview of NAEP in Chapter 2. Chapter 
3 is devoted to district-level reporting, and market-basket reporting is 
covered in Chapter 4. Because of the analogies that have been drawn 
between market-basket reporting and the CPI, we include discussion of the 
similarities and differences in Chapter 4; full details about construction 
and reporting of the CPI appear in Appendix A. The short form, which 
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FIGURE 1-1 Components of the NAEP Market Basket 
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would be created under scenario two for the market basket, is addressed in 
Chapter 5. We believe that creation and administration of short-form 
NAEP would alter the fundamental purposes of NAEP, and we take up 
these issues of “changed NAEP” in this chapter. 

NAEP s sponsors do not yet have prototypical models of either mar- 
ket-basket repons or district-level repons. During the course of our study, 
we reviewed a preliminary example of a market-basket report and a report 
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provided to one district, but neither report was presented to us as a 
prototypic market-basket or district-level report. To get a better sense of 
the design and contents of such reports, we reviewed other current NAEP 
reports. In Chapter 6, we discuss ways NAEPs sponsors might formulate 
reports to ensure their usefulness, ease of understanding, and portrayal of 
meaningful information. A detailed example of an application of these 
procedures appears in Appendix B. 

Both market-basket and district-level reporting could potentially affect 
the internal configuration of the NAEP program, because they pose chal- 
lenges for sampling, scoring, and the number and types of reports to be 
prepared. For local school systems, reporting district-level results brings 
NAEP to a more intimate level of analysis. It is not too difficult to imagine 
district-level results included in accountability systems or put to other high- 
stakes uses, especially with the rewards that have been proposed (National 
Center for Education Statistics, 2000b). In Chapter 7, we present likely 
implications of the proposed reporting practices for NAEP and for local 
educational systems. 
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Current NAEP 



This chapter begins with an overview of NAEP and highlights features 
of the current assessment program that bear on or may be affected by dis- 
trict-level and market-basket reporting practices. Later in the chapter, we 
address the issues and concerns about NAEP reports that prompted consid- 
eration of these two reporting methods. 

OVERVIEW OF NAEP 

As mandated by Congress in 1969, NAEP surveys the educational ac- 
complishments of students in the United States. According to NAEPs 
sponsors, the program has two major goals: “to reflect current educational 
and assessment practices and to measure change reliably over time” (U.S. 
Department of Education, 1999:3). The assessment informs national- and 
state-level policy makers about student performances, and thus plays an 
integral role in evaluations of the conditions and progress of the nations 
educational system. 

In addition, NAEP has proven to be a unique source of background 
information that has both informed and guided educational policy. Cur- 
rendy, NAEP includes two distinct assessment programs with different 
inst rumen tadon, sampling, administration, and reporting pracdces, referred 
to as long-term trends ARV and main NAEP (U.S. Department of Educa- 
tion, 1999). 
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Components of NAEP 

Long-term trend NAEP is a collection of test items in reading, writing, 
mathematics, and science that have been administered many times over the 
last three decades. As the name implies, trend NAEP is designed to docu- 
ment changes in academic performance over time. During the past decade, 
trend NAEP was administered in 1990, 1992, 1994, 1996, and 1999. 
Trend NAEP is administered to nationally representative samples of 9-, 
13-, and 17-year olds (U.S. Department of Education, 1999). 

Main NAEP test items reflect current thinking about what students 
know and can do in the NAEP subject areas. They are based on recently 
developed content and skill outlines in reading, writing, mathematics, sci- 
ence, U.S. history, world history, geography, civics, the arts, and foreign 
languages. Main NAEP assessments use the latest advances in assessment 
methodology. Typically, two subjects are tested at each biennial administra- 
tion. Main NAEP has two components: national NAEP and state NAEP. 

National NAEP tests nationally representative samples of students in 
grades four, eight, and twelve. In most subjects, NAEP is administered two, 
three, or four times during a 12-year period, making it possible to track 
changes in performance over time. 

State NAEP assessments are administered to representative samples of 
students in states that elect to participate. State NAEP uses the same large- 
scale assessment materials as national NAEP. It is administered to grades 
four and eight in reading, writing, mathematics, and science (although not 
always in both grades in each of these subjects). 

ANALYTIC PROCEDURES 

NAEP differs fundamentally from other testing programs in that its 
objective is to obtain accurate measures of academic achievement for groups 
of students rather than for individuals. This goal is achieved using innova- 
tive sampling, scaling, and analytic procedures. 



Sampling of Students 

NAEP tests a relatively small proportion of the student population of 
interest using probability sampling methods. The national samples for main 
NAEP are selected using stratified multistage sampling designs with three 
stages of selection: districts, schools, and students. The result is a sample of 
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about 150,000 students sampled from 2,000 schools. The sampling design 
for state NAEP has only two stages of selection: schools and students within 
schools and samples approximately 3,000 students in 100 schools per state 
(roughly 100,000 students in 4,000 schools nationwide). The school and 
student sampling plan for trend NAEP is similar to the design for national 
NAEP. In 1996, between 3,500 and 5,500 students were tested in math- 
ematics and science and between 4,500 and 5,500 were tested in reading 
and writing (Campbell, Voekl, & Donahue, 1997). 

Sampling of Items 

NAEP assesses a cross section of the content within a subject-matter 
area. Due to the large number of content areas and sub-areas within those 
content areas, NAEP uses a matrix sampling design to assess students in 
each subject. Using this design, blocks of items drawn from each content 
domain are administered to groups of students, thereby making it possible 
to administer a large number and range of items while keeping individual 
testing time to one hour for all subjects. Consequently, students receive 
different but overlapping sets of NAEP items using a form of matrix sub- 
sampling known as balanced incomplete block spiraling. This design requires 
highly complicated analyses and does not permit the performance of a par- 
ticular student to be accurately measured. Therefore, NAEP reports only 
group-level results, and individual results are not provided. 



Analytic Procedures 

Although individual results are not reported, it is possible to compute 
estimates of individuals’ performance on the overall assessment using com- 
plex statistical procedures. The observed data reflect student performance 
over the particular NAEP block the student actually took. Given that no 
individual takes all NAEP blocks, statistical estimation procedures must be 
used to derive estimates of individuals 5 proficiency on the full complement 
of skills and content covered by the assessment. The procedure involves 
combining samples of values drawn from distributions of possible profi- 
ciency estimates for each student. These individual student distributions 
are estimated from their responses to the test items and from background 
variables. The use of background variables in estimating proficiency is called 
conditioning. 

For each student, five values, called plausible values , are randomly 
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drawn from the students distribution of possible proficiency estimates. Five 
plausible values are drawn to reflect the uncertainty in a students profi- 
ciency estimate, given the limited set of test questions administered to each 
student. The sampling from the students distribution is an application of 
Rubins (1987) multiple imputation method for handling missing data (the 
responses to items not presented to the student are considered missing). In 
the NAEP context this process is called plausible values methodology (Na- 
tional Research Council, 1999b). 

The conditioning process derives performance distributions for each 
student using information about performance of other students with simi- 
lar background characteristics. That is, performance estimates are based on 
the assumption that a student s performance is likely to be similar to that of 
other students with similar backgrounds. Conditioning is performed dif- 
ferently for national and state NAEP. For national NAEP, it is based on the 
relationship between background variables and performance on test items 
for the national sample. For state NAEP, conditioning is based on the 
relationship between the background variables and item performance for 
each state; these relationships may not be the same for the different state 
samples. As a result, the estimated distributions of proficiency for two indi- 
viduals with similar background characteristics and item responses may 
differ if the individuals are from different states. 



REPORTING NAEP RESULTS 
Statistics Reported 

NAEP’s current practice is to report student performance on the as- 
sessments using a scale that ranges from 0 to 500. Scale scores summarize 
performance in a given subject area for the nation as a whole, for individual 
states, and for subsets of the population based on demographic and back- 
ground characteristics. Results are tabulated over time to provide trend 
information. 

In addition, NAEP reports performance using performance standards, 
or achievement levels. The percentage of students at or above each achieve- 
ment level is reported. NAGB has established, by policy, definitions for 
three levels of student achievement: basic, proficient, and advanced (U.S. 
Department of Education, 1999). The achievement levels describe the 
range of performance NAGB believes should be demonstrated at each 
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grade. NAGB’s definitions for each level are as follows (U.S. Department 
of Education, 1999:29): 

• Basic: partial mastery of prerequisite knowledge and skills that are 
fundamental for proficient work at each grade. 

• Proficient: solid academic performance for each grade assessed. Stu- 
dents reaching this level have demonstrated competency over chal- 
lenging subject matter, including subject-matter knowledge, 
application of such knowledge to real-world situations, and ana- 
lytical skills appropriate to the subject matter. 

• Advanced: superior performance 

NAEP also collects a variety of demographic, background, and contex- 
tual information on students, teachers, and administrators. Student demo- 
graphic information includes characteristics such as race/ ethnicity, gender, 
and highest level of parental education. Contextual and environmental 
data provide information about students’ course selection, homework hab- 
its, use of textbooks and computers, and communication with parents about 
schoolwork. Information obtained about teachers includes the training 
they received, the number of years they have taught, and the instructional 
practices they employ Administrators also respond to questions about their 
schools, including the location and type of school, school enrollment num- 
bers, and levels of parental involvement. NAEP summarizes achievement 
results by these various characteristics. 



Types of Reports 

NAEP produces a variety of reports, each targeted to a specific audi- 
ence. According to NCES, targeting each report to a segment of the audi- 
ence increases its impact and appeal (U.S. Department of Education, 1999). 
Table 2-1 below lists the various types of NAEP reports along with the 
targeted audience and general purpose for each type of report. 

Uses of NAEP Reports 

The Committee on the Evaluation of National and State Assessments 
of Educational Progress conducted an analysis of the uses of the 1996 NAEP 
mathematics and science results. The analysis considered reports of NAEP 
results in the popular and professional press, NAEP publications, and vari- 
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Type of Report 


Targeted Audience 


Purpose/ Contents 


NAEP Report Cards 


Policy makers 


Present results for all test 
takers and for various 
population groups 


Highlights Repons 


Parents, school board 
members, general 
public 


Answer frequently asked 
questions in non-technical 
manner 


Instructional 


Educators, school 


Include many of the 


Repons 


administrators, and 
subject-matter experts 


educational and instructional 
material available from the 
NAEP assessments. 


State Repons 


Policy makers, Depanment 
of Education officials, 
chief state school officers 


Present results for all test 
takers and various 
population groups for each 
state. 


Cross-State Data 


Researchers and state 


Serve as reference documents 


Compendia 


testing directors 


that accompany other 
reports and present state-by- 
state results for variables 
included in the state reports. 


Trend Repons 


[Not specified] 


Describe patterns and changes 
in student achievement as 
measured by the long-term 
trend assessments. 


Focused Repons 


Educators, policy makers, 
psychometricians, and 
interested citizens 


Explore in-depth questions 
with broad educational 
implications. 


Summary Data 
Tables 


[Not specified] 


Present extensive tabular 
summaries based on 
background data from 
student, teacher, and school 
questionnaires. 


Technical Repons 


Educational researchers, 
psychometricians, and 
other technical audiences 


Document details of the 
assessment, including sample 
design, instrument 
development, data collection 
process, and analytic 
procedures. 
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ous letters, memoranda, and other unpublished documents. They found 
that NAEP results were used to (National Research Council, 1999b:27): 

1 . describe the status of the educational system, 

2. describe student performance by demographic group, 

3. identify the knowledge and skills over which students have (or do 
not have) mastery, 

4. support judgments about the adequacy of observed performance, 

3. argue the success or failure of instructional content and strategies, 

6. discuss relationships among achievement and school and family 
variables, 

7. reinforce the call for high academic standards and educational 
reform, and 

8. argue for system and school accountability. 

These findings are similar to those cited by McDonnell (1994). 



Redesigning NAEP Reports 

The diverse audiences and uses for NAEP reports have long posed 
challenges for the assessment (e.g., Koretz and Deibert, 1995/1996). 
Concern about appropriate uses and potential misinterpretations were 
heightened by the media’s reporting on the results of the first Trial State 
Assessment (Jaeger, 1998). One of the most widespread interpretation 
problems was the media translation of mean NAEP scores into state 
rankings. Many newspapers simply ranked states according to average 
scores, notwithstanding the fact that differences among state scores were 
not statistically reliable. 

In addition, there have been misinterpretations associated with report- 
ing of achievement-level results. The method of reporting the percentage 
of students at or above each achievement level has been found to cause 
confusion (Hambleton & Slater, 1995). Because the proportion of stu- 
dents at or above the advanced level are also above the basic and proficient 
levels, and the proportion at or above proficient are also above basic, the 
percentages of students at or above all three levels add up to more than 100 
percent. This is confusing to users. The mental arithmetic that is required 
to determine the percentage that scored at a specific achievement level is 
difficult for many users of NAEP data. Other studies have cited difficulties 
associated with interpreting standard errors, significance levels, and other 
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statistical jargon included in NAEP reports (Jaeger, 1996; Hambleton & 
Slater, 1995). 

NAEP s sponsors have sought ways to improve its reports. The 1 996 
redesign of NAEP described the concept of market-basket reporting as one 
means for making reports more meaningful and understandable (National 
Assessment Governing Board, 1996). The authors of the document rea- 
soned that public release of the market basket of items would give users a 
concrete reference for the meaning of the scores. This method would also 
have the advantage of being more comfortable to users who are “familiar 
with only traditional test scores,” such as those reported as percents correct 
(Forsyth et al, 1996:6-26). 

The most recent design plan, Design 2000-2010 (National Assessment 
Government Board, 1999a), again addressed reporting issues. Authors of 
the document set forth the objective of defining the audience for NAEP 
reports. They distinguished among NAEP s audiences by pointing out that 
the primary audience is the U.S. public, while the primary users of its data 
have been national and state policy makers, educators, and researchers. The 
document stated (National Assessment Governing Board, 1999a: 10): 

[NAEP reports] should be written for the American public as the primary 
audience and should be understandable, free of jargon, easy to use and widely 
disseminated. National Assessment reports should be of high technical qual- 
ity, with no erosion of reliability, validity, or accuracy. 

The amount of detail in reporting should be varied. Comprehensive reports 
would be prepared to provide an in-depth look at a subject, using new 
adopted test framework, many students, many test questions, and ample 
background information. Results would be reported using achievement 
levels. Data also would be reported by sex, race-ethnicity, socio-economic 
status (SES), and for public and private schools. Standard reports would 
provide overall results in a subject with achievement levels and average scores. 

Data could be reported by sex, race/ethnicity, SES, and for public and private 
schools, but would not be broken down further. Special, focused assessments 
on timely topics also would be conducted, exploring a particular question or 
issue and possible limited to one or two grades. 



SUMMARY AND RECOMMENDATIONS 

NAEP serves a diverse audience with varied interests and needs. Com- 
municating assessment results to such a broad audience presents unique 
challenges. The breadth of the audiences combined with their differing 
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needs and uses for the data make effective communication particularly dif- 
ficult. The Committee on NAEP Reporting Practices views market-basket 
and district-level reporting as falling within the context of making NAEP 
results more useful and meaningful to a variety of audiences. These are 
important goals that deserve focused attention. 

RECOMMENDATION 2-1: We support the efforts thus far 
on the design of NAEP reports and encourage NAEP s spon- 
sors to continue to find ways to report NAEP results in ways 
that engage the public and enhance their understanding of 
student achievement in the United States. 
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Reporting District-Level NAEP Results 



The Improving Americas Schools Act of 1994, which reauthorized 
NAEP in that year, eliminated the prohibition against reporting NAEP 
results below the state level. Although the law removed the prohibition, it 
neither called for district- or school-level reporting, nor did it outline de- 
tails about how such practices would operate. NAGB and NCES have 
explored reporting district-level results as a mechanism for providing more 
useful and meaningful NAEP data to local policy makers and educators. 
They have twice experimented with trial district-level reporting programs. 
For a variety of reasons, neither attempt revealed much interest on the part 
of school districts. The lack of interest was attributable, in part, to financial 
considerations and to unclear policy about whether the state or the district 
had the ultimate authority to make participation decisions. Despite the 
apparent lack of interest during the attempted trial programs, there is some 
evidence that provision of district-level results could be a key incentive to 
increasing schools’ and districts’ motivation to participate in NAEP 
(Ambach, 2000). 

The focus of the committee’s work on district-level reporting was to 
evaluate the desirability, feasibility, potential uses, and likely impacts of 
providing district-level NAEP results. In this chapter, we address the follow- 
ing questions: (1) What are the proposed characteristics of a district-level 
NAEP? (2) If implemented, what information needs might it serve? 

(3) What is the degree of interest in participating in district-level NAEP? 

(4) WEat factors would influence interest? 
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STUDY APPROACH 

To gather information relevant to these questions, the committee re- 
viewed the literature that has been written about below-state reporting, 
including NCES and NAGB policy guidelines for district-level reporting 
(National Assessment Governing Board, 1995a; National Assessment Gov- 
erning Board, 1995b; National Center for Education Statistics, 1995); lis- 
tened to presentations by representatives from NAGB, NCES, and their 
contractors (ETS and Westat) regarding district-level reporting; and held a 
workshop on district-level reporting. During the workshop, representatives 
of NAGB and NCES discussed policy guidelines, prior experiences, and 
future plans for providing district-level data. Representatives from ETS 
and Westat spoke about the technical issues associated with reporting dis- 
trict-level data. Individuals representing state and district assessment offices 
participated and commented on their interest in and potential uses for 
district-level results. Representatives from national organizations (Council 
of Chief State School Officers and Council of Great City Schools) and 
authors of papers on providing below-state NAEP results served as discus- 
sants at the workshop. Approximately 40 individuals participated in the 
workshop. Workshop proceedings were summarized and published 
(National Research Council, 1999c). 

This chapter begins with a review of the concerns expressed when state 
NAEP was first implemented, as they could all relate to below-state report- 
ing. This section contains a description of the evaluations of the Trial State 
Assessment, the findings of the evaluations, and the reported benefits of 
state NAEP. The chapter continues with a summary of the chief issues 
raised by authors who have explored the advantages and disadvantages of 
providing below-state results. In the next portion of this chapter, the two 
experiences with district-level reporting are described. The first of these 
experiences is associated with the 1 996 assessment, and the other is associ- 
ated with the 1998 assessment. A summary of the information obtained 
during the committees workshop on district-level reporting is presented in 
the final portion of this chapter. 

INITIAL CONCERNS FOR STATE-LEVEL REPORTING 

Prior to implementation of the Trial State Assessment (TSA) and re- 
porting of state-level results, researchers and others familiar with NAEP 
expressed concerns about the expansion of the assessment to include state- 
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level data. These concerns centered around the anticipated uses of state- 
level data and the likely effects on curriculum and instruction. National 
NAEP had been a low-stakes examination, since data could not be used for 
decisions at the state, district, school, or classroom level. National-level 
data were not being used for accountability purposes, and participants were 
relatively unaffected by the results. With the provision of state-level results, 
some expressed concern that the stakes associated with NAEP could rise. 

Specifically, observers questioned if the reporting of the TSA would 
cause local districts and states to change the curriculum or instruction that 
is provided to students. They also questioned if local or state testing pro- 
grams would change to accommodate NAEP-tested skills or would simply 
be pushed aside. Observers also debated whether any changes in curricu- 
lum or assessment would be positive or counterproductive (Stancavage, 
Roeber, & Bohrnstedt, 1992:261). 

These questions stemmed from concerns about the emphases given 
NAEP results. As long as NAEP was a low-stakes test and decisions did not 
rest on the results, it was unlikely that states and districts would adjust their 
curriculum or assessments based on the results. But reporting results at the 
state level could increase pressure on states to change their instructional 
practices, which could threaten the validity of NAEP scores (Koretz 
1991:21). Furthermore, Koretz warned that changes in instructional prac- 
tices could harm student learning. To the degree that NAEP frameworks 
represent the full domain of material students should know, planning 
instruction around the frameworks may be appropriate. However, if schools 
“teach to the test,” meaning that they teach only a narrow domain covered 
by the assessment, then they have inappropriately narrowed the curriculum. 

Beaton (1992:14) used the term “boosterism” to describe the activities 
that might be used to motivate students to do their best for the “states 
honor.” He suggested that boosterism combined with teaching to the test 
and “more or less subtle ways of producing higher scores” could affect the 
comparability of state trend data, if these practices change or become more 
effective over time. 

Others questioned how the results might be interpreted. For instance, 
Haertel (1991:436) pointed out that the first sorts of questions asked will 
pertain to which states have the best educational systems but cautioned 
that attempts to answer would be “fraught with perils.” Haertel continued 

(p.437): 
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[Comparisons] will involve generalizations from TSA exercise pools to a 
broader range of learning outcomes . . . [Such comparisons] depend on the 
match between NAEP content and states’ own curriculum framework . . . 

For example, a state pressing to implement the [National Council of Teachers 
of Mathematics] framework might experience a (possibly temporary) decrease 
in performance on conventional mathematics problems due to its deliberate 
decision to allocate decreased instruction time to that type of problem. The 
1990 TSA might support the (valid) inference that the state’s performance on 
that type of problem was lagging, but not the (invalid) inference that their 
overall mathematics performance was lagging. 

Haertel (1991) also expected that state-to-state comparisons would 
prompt the press and others to rank states, based on small (even trivial) 
differences in performance. In fact, Stancavage et al. (1992) reported that 
in spite of cautions by NCES and Secretary of Education Lamar Alexander 
not to rank states, four of the most influential newspapers in the nation did 
so. In a review of 55 articles published in the top 50 newspapers, they 
found that state rankings were mentioned in about two-thirds of the articles 
(Stancavage et al., 1992). 

Other concerns pertained to the types of inferences that NAEP s vari- 
ous audiences might draw based on the background, environmental, and 
contextual data that are reported. These data provide a wealth of informa- 
tion on factors 'that relate to student achievement. However, the data col- 
lection design does not support inferences that these factors caused the 
level of achievement students attained nor does it meet the needs of ac- 
countability purposes. The design is cross sectional in nature, assessing 
different samples of students on each testing occasion. Such a design does 
not allow for the before-and-after data required to hold educators respon- 
sible for results. Furthermore, correlations of student achievement on 
NAEP with data about instructional practices obtained from the back- 
ground information do not imply causal relationships. For example, the 
1994 NAEP reading results showed that fourth-grade students who re- 
ceived more than 90 minutes of reading instruction a day actually per- 
formed worse than students receiving less instruction. Clearly, the low- 
performing students received more hours of instruction as a result of their 
deficiencies; the extra instruction did not cause the deficiencies (Glaser, 
Linn, & Bohrnstedt, 1997). 

Benefits Associated with State NAEP 

Despite these concerns about the provision of state-level data, reviews 
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of the TSA have cited numerous benefits and positive impacts of the pro- 
gram. Feedback from state assessment officials indicated that state NAEP 
has had positive influences on instruction and assessment (Stancavage et 
al., 1992; Stancavage, Roeber, & Bohrnstedt, 1993; Hartka & Stancavage, 
1994; DeVito, 1997). When the TSA was first implemented, many states 
were in the process of revamping their frameworks and assessments in both 
reading and mathematics. According to state officials, in states where 
changes were under way, the TSA served to validate the changes being 
implemented; in states contemplating changes, the TSA served as an 
impetus for change. 

Respondents to surveys conducted by Stancavage and colleagues 
(Hartka & Stancavage, 1 994) reported that the following changes in read- 
ing assessment and instruction were taking place: increased emphasis on 
higher-order thinking skills; better alignment with current research on read- 
ing; development of standards-based curricula; increased emphasis on lit- 
erature; and better integration or alignment of assessment and instruction. 
Although these changes could not be directly attributed to the implemen- 
tation of the TSA, they reflected priorities also set for the NAEP reading 
assessment. In addition, many state assessment measures were expanded to 
include more open-ended response items, with an increased emphasis on 
the use of authentic texts and passages, like those found on NAEP (Hartka 
& Stancavage, 1994). 

At the time of the first TSA, the new mathematics standards published 
by the National Council of Teachers of Mathematics (NCTM) were having 
profound effects on mathematics curricula, instructional practice, and 
assessment throughout the country (Hartka & Stancavage, 1994). Survey 
results indicated that changes similar to those seen for reading were occur- 
ring in mathematics instruction and assessment: alignment with the 
N CTM standards, increased emphasis on higher-order thinking skills and 
problem solving, development of standards-based curricula, and integra- 
tion or alignment of assessment and instruction (Hartka & Stancavage, 
1 994). The mathematics TSA was also influential in “tipping the balance in 
favor of calculators (in the classroom and on assessments) and using sample 
items [for] teacher in-service training” (Hartka & Stancavage, 1994:431). 
Again, although these changes could not be attributed to the TSA, the 
NAEP mathematics frameworks’ alignment with the NCTM standards 
served to reinforce the value of the professional standards. 

In 1990, results from the first TSA in 1990 garnered much attention 
from the media and the general public. For states with unsatisfactory per- 
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formance, TSA results were helpful in spurring reform efforts. For states 
with satisfactory TSA performance, state officials could attribute the results 
to the recent reforms in their instructional practice and assessment mea- 
sures. 



LITERATURE ON BELOW- STATE REPORTING 

In “The Case for District- and School-Level Results from NAEP,” 
Selden (1991) made the seemingly self-evident argument that having infor- 
mation is better than not having it, saying (pg. 348), “most of the time, 
information is useful, and the more of it we have, the better, as long as the 
information is organized and presented in a way that [makes] it useful.” 
Selden claimed that because NAEP is conducted and administered simi- 
larly across sites (schools), it offers comparable information from site to 
site, thus allowing state-to-state or district-to-district comparisons. He finds 
that NAEP s ability to collect high quality data comparably over time and 
across sites lends it to powerful uses for tracking both student achievement 
and background information. According to Selden, questions that might 
be addressed by trend data include: are instructional practices changing in 
the desired directions; are the characteristics of the teacher workforce get- 
ting better; and are home reading practices improving. He explained that 
schools and districts could use trend information to examine their students' 
achievement in relation to instructional methods. 

While Selden presented arguments in favor of providing below-state- 
level results, he and others (Haney and Madaus, 1991; Beaton, 1992; 
Roeber, 1994) also cautioned that reporting results below the state level 
could lead to a host of problems and misuses. Their arguments emphasized 
that, although having more information could be viewed as better than 
having less information, it is naive to ignore the uses that might be made of 
the data. Indeed, Selden (1991:348) pointed out that one fear is that new 
information will be “misinterpreted, misused, or that unfortunate, unfore- 
seen behavior will result from it.” Reports of below-state NAEP results 
could easily become subject to inappropriate high-stakes uses. For example, 
results could be used for putting districts or schools into receivership; mak- 
ing interdistrict and interschool comparisons; using results in school choice 
plans; holding teachers accountable; and allocating resources on the basis 
of results (Haney and Madaus, 1991). In addition, some authors worried 
that NAEPs use as a high-stakes accountability device at the local level 
could lead to teaching to the test and distortion of the curriculum (Selden, 




46 



36 



NAEP REPORTING PRACTICES 



1991, Beaton, 1992). Selden (1991) further argued that the use of NAEP 
results at the district or school level has the potential to discourage states 
and districts from being innovative in developing their own assessments. 

Potential high-stakes uses of NAEP would heighten the need for secu- 
rity. Item development would need to be stepped up, which would raise 
costs (Selden, 1991). NAGB, NCES, the NAEP contractors, and partici- 
pating school district staff, would also have to coordinate efforts to ensure 
that the NAEP assessments are administered in an appropriate manner. 
According to Roeber (1994:42), such overt action would be needed “to 
assure that reporting does not distort instruction nor negatively impact the 
validity of the NAEP results now reported at the state and national levels.” 

EXPERIENCES WITH DISTRICT-LEVEL REPORTING 

NAGB and NCES supported the initiative to provide district-level re- 
sults, hoping that school districts would choose to use NAEP data to in- 
form a variety of education reform initiatives at the local level (National 
Assessment Governing Board, 1995a; National Assessment Governing 
Board, 1995b). With the lifting of the prohibition against below-state re- 
porting, NAGB and NCES explored two different procedures for offering 
district-level NAEP data to districts and states: the Trial District Assess- 
ment, offered in 1996, and the Naturally-Occurring District Plan, offered 
in 1998. 



The 1996 Experience: Trial District Assessment 

Under the Trial District Assessment, large school districts were offered 
three options for participating in district-level reporting of NAEP (Na- 
tional Center for Educational Statistics, 1995). The first option, “Augmen- 
tation of State NAEP Assessment,” offered district-level results in the same 
subjects and grades as in state NAEP by augmenting the district’s portion 
of the state NAEP sample. Under this option, districts would add “a few 
schools and students” to their already selected sample in order to report 
stable estimates of performance at the district level. According to the 
NCES, the procedures for augmenting the sample would “minimize the 
cost of the assessment process,” and costs were to be paid by the district. 

The second option in 1996, “Augmentation of National Assessment,” 
would allow for reporting district results in subjects and grades adminis- 
tered as part of national NAEP by augmenting the number of schools 
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selected within certain districts as part of the national sample. Because few 
schools are selected in any single district for national NAEP, this second 
option would require most school districts to select “full samples of schools” 
(National Center for Education Statistics, 1995:2) to meet the sampling 
requirements and to report meaningful results. The cost for augmenting 
the national sample for participating districts would be more substantial 
than those associated with augmenting the state sample. If a district se- 
lected either of these options, the procedures for sample selection, adminis- 
tration, scoring, analysis, and reporting would follow those established for 
national or state NAEP, depending on the option selected. And the results 
would be “NAEP comparable or equivalent.” 

The third option in 1996, “Research and Development,” was offered 
to districts that might not desire NAEP-comparable or equivalent results 
but that had alternative ideas for using NAEP items. For example, districts 
might assess a subject or subjects not assessed by NAEP at the national or 
state level; they might want to administer only a portion of the NAEP 
instrument; or they might choose to deviate from standard NAEP proce- 
dures. NCES would regard such uses as research and development activi- 
ties and would not certify the results obtained under this option as NAEP 
comparable or equivalent. 

Prior to the 1996 administrations, NCES (with the assistance of the 
sampling contractor, Westat) determined that the minimum sampling re- 
quirements for analysis and reporting at the district level were 25 schools 
and 500 assessed students per grade and subject. To gauge interest in the 
plan, NCES and ETS sponsored a meeting during the 1995 annual meet- 
ing of the American Educational Research Association, inviting representa- 
tives from several of the larger districts in the country. Based on this meet- 
ing and further interaction with district representatives, NCES identified 
approximately 10 school systems interested in obtaining their NAEP re- 
sults. NCES and their contractors held discussions with representatives of 
these districts. The costs turned out to be much higher than school systems 
could easily absorb (National Research Council, 1999c). Consequently, 
only Milwaukee participated in 1996, with financial assistance from the 
National Science Foundation. Additional sampling of schools and stu- 
dents was required for Milwaukee to reach the minimum numbers neces- 
sary for participation, and they received results only for grade eight. 
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Milwaukee’s Experience under the Trial District Assessment 

In the spring of 1996, NAEP was administered to a sample of 
Milwaukee’s school population, and results were received in September 
1997. NCES prepared a special report for the school district summarizing 
performance overall and by demographic, environmental, background, and 
academic characteristics. Explanatory text accompanied the tabular reports. 

Paul Cieslak, former research specialist with the Milwaukee school dis- 
trict, attended the committees workshop and described the uses made of 
the reported data. According to Cieslak, the report was primarily used as 
part of a day-long training session with 45 math/science resource teachers, 
under the districts NSF Urban Systemic Mathematics/Science Initiative to 
help the teachers work with project schools (Cieslak, 2000). The teachers 
found the overall performance and demographic information moderately 
helpful. The reports summarizing performance by teaching practices and 
by background variables and institutional practices were more useful and 
interesting. Milwaukee officials found that the NAEP results generally 
supported the types of instructional practices they had been encouraging. 

According to Cieslak (2000), the School Environmental data “in- 
creased the value of the NAEP reports tenfold” since districts do not have 
the time or the resources to collect these data. This information helped 
school officials to look at relationships among classroom variables and per- 
formance. Cieslak believed that availability of the School Environmental 
data could be one of the strongest motivating factors behind districts* 
interest in participation. 

While no specific decisions were based on the data, Cieslak believed 
that was primarily because so much attention is focused on their state and 
local assessments, especially those included in the districts accountability 
plan. In Milwaukee, the various assessment programs compete for atten- 
tion, and the statewide assessments usually win out. Cieslak believes that 
state assessments will continue to receive most of the attention unless some 
strategies are implemented to demonstrate specifically how NAEP data are 
related to national standards, specific math/science concepts, or district 
goals. 



The 1998 Experience: Naturally Occurring Districts 

Prior to the 1998 NAEP administration, NCES and Westat deter- 
mined that there were six “naturally occurring districts” in state samples. 
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They defined naturally occurring districts as those that comprise at least 20 
percent of the states sample and that meet the minimum sampling require- 
ments for analysis and reporting at the district level (25 schools and 500 
assessed students per grade and subject). These districts can be thought of 
as “self-representing in state NAEP samples” (Rust, 1999). The districts 
that met these guidelines in 1998 were Albuquerque, New Mexico; 
Anchorage, Alaska; Chicago, Illinois; Christiana County, Delaware; Clark 
County, Nevada; and New York City, New York. 

In July 1998, NCES contacted district representatives to assess their 
interest in receiving district-level NAEP results at no additional cost. They 
found no takers. Alaska did not participate in 1998, and Christiana County 
expressed no interest. District representatives in New York City and Chi- 
cago did not want the data. Gradually, the idea of providing district-level 
reports grew increasingly controversial. The NAEP State Network, which 
consists of state assessment directors or their appointed representatives, 
voiced concerns about the fairness of making the data available for some 
districts but not others. NCES did not query Clark County or Albuquer- 
que, or their respective states, as to their interest, since by then the idea of 
district-level reporting was being questioned (Arnold Goldstein, National 
Center for Education Statistics, personal communication, October 1999). 

Controversy arose concerning who would make participation and re- 
lease decisions for a district-level NAEP. Although New York and Chicago 
did not want the data, their respective states did, thereby creating a con- 
flict. NAGB discussed the issue at its August 1999 meeting and decided 
that no further offers of district results should be made until it was clear 
who should be the deciding entity (National Assessment Governing Board, 
1999d). 



TECHNICAL AND POLICY CONSIDERATIONS FOR 
DISTRICT-LEVEL REPORTING 

As part of the workshop on district-level reporting, the committee 
asked representatives from NAGB, NCES, ETS, and Westat to discuss the 
technical issues related to sampling and scoring methodologies and the 
policy issues related to participation and reporting decisions. The text 
below summarizes the information provided by NAEP s sponsors and con- 
tractors. 
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Proposed Sampling Design for Districts 

In preparation for the workshop, NCES and West at provided two 
documents that outlined the proposed sampling plans for district-level re- 
porting (Rust, 1999; National Center for Education Statistics, 1995). For 
state NAEP, the sample design involves two-stage stratified samples. Schools 
are selected at the first stage, and students are selected at the second stage. 
The typical state sample size is 3,000 students per grade and subject, with 
30 students per school. The sample sizes desired for district results would 
be roughly one-quarter that required for states (750 sampled students at 25 
schools, to yield 500 participants at 25 schools). This sample size would be 
expected to produce standard errors for districts that are about twice the 
size of standard errors for the state. 

Districts that desired to report mean proficiencies by background char- 
acteristics — such as race, ethnicity, type of courses taken, home-related vari- 
ables, instructional variables, and teacher variables — would need sample 
sizes approximately one-half of their corresponding state sample sizes, or 
approximately 1,500 students from a minimum of 50 schools. Por report- 
ing, the “rule of 62” would apply, meaning that disaggregated results would 
be provided only for groups with at least 62 students (National Assessment 
Governing Board, 1995b: Guideline 3). 

At the workshop, Richard Valliant, associate director of Westat s Statis- 
tical Group, further outlined the sampling requirements for districts. 
Valliant described the “sparse state” option, that would require fewer schools 
but would sample more students at the selected schools, and the “small 
state” option, that would reduce the number of students tested per school. 
Both options would still require 500 participating students. These sample 
sizes would allow for the reporting of scaled scores, achievement levels, and 
percentages of students at or above a given level for the entire district, but 
would probably not allow for stable estimates of performance for subgroups 
of the sample. 

Peggy Carr, associate commissioner in the Assessment Division at 
NCES, described two additional alternatives under consideration, the “en- 
hanced district sampling plan” and the “analytic approach.” The enhanced 
district sampling plan would reconfigure the state sampling design so that 
sufficient numbers of schools were sampled for interested districts. This 
plan might require oversampling at the district level and applying appro- 
priate weights to schools, and perhaps districts, during analysis. The ana- 
lytic approach, according to Carr, would allow districts to access existing 
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data in order to identify districts like themselves and compare analytic 
results. Carr noted that development of details about this option were still 
under way. 



Scoring Methodolgy 

During the workshop, Nancy Allen, director of NAEP analysis and 
research at ETS, described the scoring methodology currently used for 
NAEP and explained how procedures would be adapted to generate dis- 
trict-level results. Allen reminded participants that ability estimates are not 
computed for individuals because the number of items to which any given 
student responds is insufficient to produce a reliable performance estimate. 
She described procedures used to generate the likely ability distributions 
for individuals, based on their background characteristics and responses to 
NAEP items (the conditioning procedures), and to randomly draw five 
ability estimates (plausible values) from these distributions. She noted that 
for state NAEP, the conditioning procedures utilize information on the 
characteristics of all test takers in the state. 

Participants and committee members raised questions about the infor- 
mation that would be included in the conditioning models for districts. 
For example, would the models be based on the characteristics of the state 
or the characteristics of the district? If models were based on the character- 
istics of the state, and the characteristics of the state differed from those of 
the district, would that affect the estimates of performance? Allen responded 
that the conditioning models rely on information about the relationships 
(covariation) between performance on test items and background charac- 
teristics. According to Allen, sometimes the compositional characteristics 
of the state and a district will differ with respect to background variables, 
but the relationships between cognitive performance and background char- 
acteristics may not differ. Nevertheless, Allen stressed that they were still 
exploring various models for calculating estimates at the district level, 
including some that condition on district characteristics. 

Given the potential bias in proficiency estimates that could result from 
a possibly erroneous conditioning model, the committee offers the follow- 
ing recommendation regarding conditioning procedures. 

RECOMMENDATION 3-1: If the decision is made to move 
forward with providing district-level results, NAEP’s sponsors 
should collect empirical evidence on the most appropriate pro- 
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cedures for improving the accuracy of estimates of achieve- 
ment using demographic and background variables (condi- 
tioning and plausible values technology). Conditioning is 
most defensible when based on district-level background vari- 
ables. Empirical evidence should be gathered before selecting 
an alternate procedure, supporting its acceptability. 



Participation Decisions 

Roy Truby, executive director of NAGB, told participants that when 
Congress lifted the ban on below-state reporting, it neglected to include 
language in the law that clarified the roles of states and districts in making 
participation decisions. In 1998, when NCES offered results to the natu- 
rally occurring districts, the agency sent letters to both the districts and 
their respective states. Based on legal advice from the Department of 
Educations Office of General Counsel, the agency determined that state 
officials, not district officials, would make decisions about release of results. 
In at least one case, there appeared to be a conflict in which the state wanted 
the data released, but the district did not. NAGB members were concerned 
that the districts were not told when they agreed to participate in 1998 
NAEP that results for their districts might be released. Because of this 
ambiguity about decision-making procedures, NAGB passed the following 
resolution (National Assessment Governing Board, 1 999d) : 

Since the policy on release of district-level results did not envision a disagree- 
ment between state and district officials, the Governing Board hereby sus- 
pends implementation of this policy, pending legislation which would pro- 
vide that the release of district-level NAEP results must be approved by both 
the district and state involved. 

The committee asked workshop participants to discuss their opinions 
about the entity (states or districts) that should have decision-making 
authority over participation and release of data. In general, district repre- 
sentatives believed that the participating entity should make participation 
decisions, while state representatives believed that the decision should lie 
with the state. Others thought that the entity that paid for participation 
should have decision-making authority. However, speakers stressed that 
the most pertinent issue was not about participation but about public 
release of results. Under the Freedom of Information Act, district results 
would be subject to public release once they were compiled. 
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REACTIONS FROM WORKSHOP PARTICIPANTS 

Workshop participants discussed technical and policy issues for dis- 
trict-level NAEP and made a number of observations. They are discussed 
next. 



Comparisons Among Similar Districts 

Like Selden (1991), some workshop participants found that district- 
level reporting would enable useful and important comparisons. Several 
state and district officials liked the idea of being able to make comparisons 
among similar districts. District officials reported that often others in the 
state do not understand the challenges they face, and comparisons with 
similar districts across state boundaries would enable them to evaluate their 
performance given their particular circumstances. For instance, some dis- 
tricts are confronting significant population growth that affects their avail- 
able resources. Others, such as large urban districts, have larger populations 
of groups that tend to perform less well on achievement tests. District 
officials believed that if performance could be compared among districts 
with similar characteristics, state officials might be more likely to set more 
reasonable and achievable expectations. Further, they noted that this prac- 
tice might allow them to identify districts performing better than expected, 
given their demographics, and attention could focus on determining in- 
structional practices that work well. 

A number of workshop participants were worried about the uses that 
might be made of district-level results. Some expressed concern that results 
would be used for accountability purposes and to chastise or reward school 
districts for their students 5 performance. Using district-level results as part 
of accountability programs would be especially problematic if the content 
and skills covered by NAEP were not aligned with local and state curricula. 
Officials from some of the larger urban areas also argued that they already 
know that their children do not perform as well as students in more afflu- 
ent suburban districts. Having another set of assessment results would 
provide yet another opportunity for the press and others to criticize them. 

Other state and district officials commented that states’ varied uses of 
assessments may confound comparisons. While districts may seem compa- 
rable based on their demographics, they may in fact be very different, be- 
cause of the context associated with state assessment programs. States dif- 
fer in the emphases they place on test results, the uses of the scores, and the 
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amounts and kinds of attention results receive from the press. These fac- 
tors play a significant role in setting the stage for the testing and can make 
comparisons misleading, even when districts appear similar because of their 
student populations. 



External Validation 

Some state and district officials were attracted to the prospect of hav- 
ing a means for external validation. They find NAEP to be a stable external 
measure of achievement against which they could compare their state and 
local assessment results. However, some also noted that attempts to obtain 
external validation for state assessments can create a double bind. When 
the findings from external measures corroborate state assessment results, no 
questions are asked. However, when state or local assessment results and 
external measures (such as state NAEP) differ, assessment directors are of- 
ten asked, “Which set of results is correct?” Explaining and accounting for 
these differences can be challenging. Having multiple indicators that sug- 
gest different findings can lead to public confusion about students’ achieve- 
ment. 

These challenges are particularly acute when a state or local assessment 
is similar, but not identical, to NAEP. For example, some state assessment 
programs have adopted the NAEP descriptors (advanced, proficient, and 
basic) for their achievement levels. However, their descriptions of perfor- 
mance differ in important ways from the NAEP descriptions. NAEP s 
definition of “proficient,” for instance, may encompass different skills than 
the states definition, creating problems for those who must explain and 
interpret the two sets of test results. 

Some district and state officials expressed concern about the alignment 
between their curricula and the material tested on NAEP Their state and 
local assessments are part of an accountability system that includes instruc- 
tion, assessment, and evaluation. NAEP results would be less meaningful if 
they were based on content and skills not covered by their instructional 
programs. Attempts to use NAEP as a means of external validation for the 
state assessment is problematic when the state assessment is aligned with 
instruction and NAEP is not, particularly if results from the different as- 
sessments suggest different findings about students’ achievement. 

In addition, confusion arises when NAEP results are released at the 
same time as state or local assessment results. State and local results are 
timely, generally reporting data for a cohort while it is still in the particular 
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grade. For instance, when reports are published on the achievement of a 
school systems fourth graders, they represent the cohort currendy in fourth 
grade. When NAEP results are published, they are for some previous year s 
fourth graders. This again can lead to public confusion over students’ aca- 
demic accomplishments. 



Supplemental Assessments 

An appealing feature to state and district officials participating in the 
workshop was the possibility of having assessment results in subject areas 
and grades not tested by their state or local programs. Although state and 
local programs generally test students in reading and mathematics, not all 
provide assessments of all of the subject areas NAEP assesses, such as writ- 
ing, science, civics, and foreign languages. Some participants liked the idea 
of receiving results for twelfth graders, a grade not usually tested by state 
assessments. Also, NAEP collects background data that many states do not 
have the resources to collect. Some workshop participants have found the 
background data to be exceedingly useful and would look forward to receiv- 
ing reports that would associate district-level performance with background 
and environmental data. 



Lack of Program Details 

Workshop participants were bothered by the lack of specifications 
about district-level reporting. Even though the committee asked NAEP s 
sponsors to describe the plans and features of district-level reporting, many 
of the details have not yet been determined. In responding to questions 
put to them about district-level reporting, many participating state and 
district officials formulated their own assumptions and reacted to the pro- 
gram they thought might be enacted. For instance, as mentioned above, 
they assumed that assessments would be offered in the subject areas and 
grades available for national NAEP; however, district NAEP has currently 
only been associated with state NAEP. Hence, only reading, mathematics, 
writing, and science would be available and only in grades 4 and 8 (not 12). 
Those that looked forward to receiving data summarized by background 
characteristics would likely be disappointed given the sample sizes required 
to obtain such information. 

Other state and district officials commented that their reactions to the 
propositions set forth by NAEPs sponsors would depend upon the details. 
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Some of their questions included: How much would it cost to participate 
in district-level NAEP? Who would pay for the costs? How would it be 
administered — centrally, as with national NAEP, or locally, as with state 
NAEP? What type of information would be included in the reports? How 
long would it take to receive results? Would district-level results require the 
same time lag for reporting as national and state NAEP? The answers to 
these questions would determine whether or not they would be interested 
in participating. 

Of concern to a number of participants, particularly to representatives 
from the Council of Chief State School Officers, was the issue of small 
districts. The sampling specifications described at the workshop indicated 
that districts would need at least 25 schools in a given grade level to receive 
reports. Technical experts present at the workshop wondered if sufficient 
thought had been given to the sample size specifications. If the district met 
the sample size requirements for students (i.e., at least 750 students), the 
number of schools should not matter. In state and national NAEP, there is 
considerable variation in average achievement levels across schools, and only 
a small percentage of schools are sampled and tested. A target of 100 
schools was set to be sure that the between-school variation was adequately 
captured. In district NAEP, there would be fewer schools and less variabil- 
ity between schools. In smaller districts, all schools might be included in 
the sample, thereby eliminating completely the portion of sampling error 
associated with between-school differences. Technical experts and others at 
the workshop encouraged NCES and Westat to pursue sampling specifica- 
tions and focus on the estimated overall accuracy of results rather than on 
specifying an arbitrary minimum number of schools based on current pro- 
cedures for State or National NAEP. 

Others questioned how “district” might be defined and if district con- 
sortia would be allowed. Some participants were familiar with the First in 
the World consortium, formed by a group of districts in Illinois to partici- 
pate and receive results from the Third International Mathematics and Sci- 
ence Study. They wondered if such district consortia would be permitted 
for NAEP. 



SUGGESTIONS FOR NAEP’S SPONSORS 

The reporting system that is the subject of this chapter would create a 
new program with new NAEP products. One of the objectives for conven- 
ing the committees workshop on district-level reporting was to learn about 
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the factors that would affect states’ and districts’ interest in this new prod- 
uct. After listening to workshop participants’ comments and reviewing the 
available materials, the committee finds that many of the details regarding 
district-level reporting have not been thoroughly considered or laid out. 
District officials, state officials, and other NAEP users — the potential users 
of the new product — had a difficult time responding to questions about 
the product’s desirability because a clear conception of its characteristics 
was not available. The most important issues requiring resolution are 
described below. 



Clarify the Goals and Objectives 

The goals and objectives of district -level reporting were not apparent 
from written materials or from information provided during the workshop. 
Some workshop participants spoke of using tests for accountability pur- 
poses, questioning whether NAEP could be used in this way or not. They 
discussed the amount of testing in their schools and stressed that new test- 
ing would need to be accompanied by new (and better) information. How- 
ever, some had difficulty identifying what new and better information 
might result from district-level NAEP data. Their comments might have 
■ been different, and perhaps more informative, if they had a clear idea of the 
purposes and objectives for district-level reporting. An explicit statement is 
needed that specifies the goals and objectives for district-level reporting and 
presents a logical argument for how the program is expected to achieve the 
desired outcomes. 



Evaluate Costs and Benefits 

What would districts and states receive? When would they receive the 
information? How much would it cost? What benefits would be realized 
from the information? Workshop participants responded to questions about 
their interests in the program without having answers to these questions, 
though many said that their interest would depend on the answers. They 
need information on the types of reports to be prepared along with the 
associated costs. They need to know about the time lag for reporting. 
Would reports be received in time to use in their decision and policy mak- 
ing or would the time delays be such as to render the information useless? 

Costs and benefits must be considered in terms of teachers’ and stu- 
dents’ time and effort. State systems already extensively test fourth and 
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eighth graders. If time is to be taken away from instruction for the purpose 
of additional testing, the benefits of the testing need to be laid out. Will 
additional testing amplify the information already provided? Or will the 
information be redundant to that provided from current tests? Will the 
redundancy make it useful for external validation? Such information needs 
to be provided in order for NAEP’s sponsors to assess actual levels of inter- 
est in the program. 



Evaluate Participation Levels 

During the workshop, many spoke of the value of being able to make 
inter- district comparisons based on districts with like characteristics. How- 
ever, this use of the results assumes that sufficient numbers of districts will 
participate. Previous experiences with district-level reporting resulted in a 
relatively low level of interest: between 10 and 12 interested districts in 
1996 and virtually none in 1998. 

Meaningful comparisons, as defined by demographic, political, and 
other contextual variables of importance to districts require a variety of 
other districts with district-level reports. Having only a handful of districts 
that meet the sampling criteria may limit one of the most fundamental 
appeals of district -level reporting — that is, carefully selecting others with 
which to compare results. Thus, if making comparisons is the primary 
objective for receiving district-level reports, the targeted districts must feel 
secure in knowing that there are sister districts also completing the neces- 
sary procedures for receiving district-level results. The extent of participa- 
tion will limit the ability to make the desired comparisons. 



Consider the Impact of Raising the Stakes 

A concern expressed when state NAEP was first implemented related 
to the potential for higher stakes to be associated with reporting data for 
smaller units. The message from several workshop speakers (particularly 
district representatives) was that district-level reports would raise the stakes 
associated with NAEP and change the way NAEP results are used. An 
evaluation should be conducted on the effects of higher stakes, particularly 
as they relate to the types of inferences that may be made. 
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CONCLUSIONS AND RECOMMENDATIONS 

It was impossible for the committee to gauge actual interest in district- 
level reporting because too little information — such as program objectives, 
specifications, and costs — was available to potential users. When develop- 
ing a new product, it is common to seek reactions from potential users to 
identify design features that will make it more attractive. The reactions of 
potential users and the responses from product designers tend to produce a 
series of interactions like “Tell me what the new product is and I will tell 
you if I like it,” versus “Tell me what you would like the product to be and 
I will make sure it will have those characteristics.” During the committee’s 
workshop, state and district representatives were put in the position of re- 
sponding to the latter question. Here, the developer is asking the user to 
do some of the design work. Often times the user is not knowledgeable 
enough to give sound design recommendations. Instead, the product de- 
signer needs to present concrete prototypes to get credible evaluative reac- 
tion. And, the developer should expect several iterations of prototype de- 
sign and evaluation before the design stabilizes at a compromise between 
users’ needs and what is practically possible. This is the type of process 
required before ideas and products associated with district-level reporting 
can progress. 

RECOMMENDATION 3-2: Market research emphasizing 
both needs analysis and product analysis is necessary to evalu- 
ate the level of interest in district-level reporting. The deci- 
sion to move ahead with district-level reporting should be 
based on the results of market research conducted by an inde- 
pendent market-research organization. If market research sug- 
gests that there is little or no interest in district-level report- 
ing, NAEPs sponsors should not continue to invest NAEP s 
limited resources pursuing district-level reporting. 
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Market-basket reporting for NAEP has been proposed as a way to sum- 
marize academic achievement on a representative collection of NAEP test 
items. The objectives for market-basket reporting are twofold: to summa- 
rize performance in a way that is more comfortable to users; and to release 
the collection of items so that users would have a concrete reference for the 
meaning of the scores (National Assessment Governing Board, 1997). In 
addition, the market-basket approach would make it possible to track per- 
formance over time to document changes in students’ academic accom- 
plishments. The ultimate goal is to better communicate what students in 
the United States are expected to know and be able to do, according to the 
subject areas, content and skills, and grade levels assessed on NAEP. 

The earliest references to market-basket reporting of NAEP assessments 
appeared in the “Policy Statement on Redesigning the National Assessment 
of Educational Progress” (National Assessment Governing Board, 1996) 
and in the Design and Feasibility Team s Report to NAGB (Forsyth et al., 
1996). These documents referred to market-basket reporting as “domain- 
score reporting” where a “goodly number of test questions are developed 
that encompass the subject, and student results are reported as a percentage 
of the ‘domain that students ‘know and can do’” (National Assessment 
Governing Board, 1996:13). According to these documents, the general 
idea of a NAEP market basket draws on an image similar to the Consumer 
Price Index (CPI): a collection of test questions representative of some larger 
content domain; and an easily-understood index to summarize performance 
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on the items. These writings generally refer to two components of the 
NAEP market basket, the collection of items and the summary index. The 
documents consider collections of items that are large (e.g., too many items 
to be administered to a single student in its entirety) or small (e.g., small 
enough to be considered an administrable test form). They consider per- 
cent correct scores as the metric for summarizing performance on the col- 
lection of items, a metric NAEP s sponsors believe is widely understood 
(National Assessment Governing Board, 1997). Figure 1-1 (see Chapter 1) 
provides a pictorial description for the NAEP market basket and its various 
components. 

Perceptions about the configuration and uses for the NAEP market 
basket are not uniform. NAGB’s current policies address the short form 
version of a market basket, stating that its goal is to “enable faster, more 
understandable initial reporting of results” and to allow states access to test 
instruments to obtain NAEP results in years when particular subjects are 
not scheduled (National Assessment Governing Board, 1999a). Educators 
from both the state and local level who participated in the committee’s 
workshops envisioned NAEP market-basket forms as short forms that could 
be used as an alternative to or in connection with their local assessments 
possibly for the purpose of comparing local assessment results with NAEP 
results (National Research Council, 2000). At the committees workshop 
and in his writings on domain score reporting, Bock ( 1 997) described the 
market basket as a tool for sampling student knowledge over the entire 
domain of any given content area. Under Bock’s conception, the focus 
would extend beyond what is measured by NAEP and would support score 
inferences that provide information about how a student would perform 
on the larger domain. If one were to draw a direct parallel between the CPI,, 
an economic index that summarizes actual consumer purchases, one could 
reasonably expect a market basket positioned as an educational index to 
measure and report exacdy what it is that students are learning. 

The intent of this chapter is to explore various conceptions of market- 
basket reporting and discuss issues associated with NAEP’s implementation 
of such a reporting mechanism. We address the following study questions: 
(1) What is market-basket reporting? (2) What information needs might be 
served by market-basket reporting for NAEP? (3) Are market-basket re- 
ports likely to be relevant and accurate enough to meet these needs? This 
chapter deals more broadly with market-basket reporting, while the next 
chapter focuses specifically on the short form. 
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The first section of this chapter lays out the psychometric issues that 
should be considered in connection with market-basket reporting. This is 
followed by a description of the pilot study currently under way at ETS and 
comments made by participants in the committees workshop. The final 
section of the chapter presents details on the methodology for constructing 
and reporting results for the CPI market basket. 

STUDY APPROACH 

During the course of the study, the committee reviewed the literature 
and the policy guidelines pertaining to market-basket reporting, including 
the following documents: Design and Feasibility Teams report to NAGB 
(Forsyth et al., 1996); ETS s proposal (Educational Testing Service, 1998); 
various studies on domain score reporting (Bock, 1997; Bock, Thissen, & 
Zimowski, 1997; Pommerich & Nicewander, 1998); and policy guidelines 
included in the 1996 NAEP Redesign (National Assessment Governing 
Board, 1996) and in NAEP Design 2000-2010 policy (National Assess- 
ment Governing Board, 1999a). In addition, the committee listened to 
presentations by NAGB and NCES staff about market-basket reporting. 

As mentioned in Chapter 1, the committee held a workshop on 
market-basket reporting which provided a forum for discussions with 
representatives of the organizations involved in setting policy for and oper- 
ating NAEP (NAGB and NCES) along with individuals from ETS, the 
contractual agency that works on NAEP. In preparation for the workshop, 
NCES, NAGB, and ETS staff prepared the following papers: 

1 . A Market Basket for NAEP: Policies and Objectives of the National 
Assessment Governing Board by Roy Truby, executive director of 
NAGB 

2 . Simplifying the Interpretation of NAEP Results With Market Baskets 
and Shortened Forms of NAEP by Andrew Kolstad, senior technical 
advisor for the Assessment Division at NCES 

3. Evidentiary Relationships among Data-Gathering Methods and 
Reporting Scales In Surveys of Educational Achievement by Robert 
Mislevy, distinguished research scholar with ETS 

4. NAEP’s Year 2000 Market Basket Study : What Do We Expect to 
Learn? by John Mazzeo, executive director of ETS s School and 
College Services 
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Individuals representing a variety of perspectives — education policy, 
assessment, curriculum and instruction, measurement, and the press — 
reacted to the ideas presented by NAEPs sponsors and contractors. Because 
the conception of the market basket has often been illustrated through 
analogies to the CPI market basket, we also arranged for a briefing on the 
CPI from a representative of the Bureau of Labor Statistics. Approximately 
40 individuals participated in the workshop, and the results were summa- 
rized and published (National Research Council, 2000). 

PSYCHOMETRIC CONSIDERATIONS FOR THE 
NAEP MARKET BASKET 

While the idea behind market-basket reporting is to produce more 
easily understood test results, the “behind-the-scenes” technology required 
to enable such reporting methodology is quite complex. During the work- 
shop, Robert Mislevy laid the conceptual groundwork for the technical and 
measurement issues involved in market-basket reporting (Mislevy, 2000); 
Andrew Kolstad traced the history of NAEP reporting practices (Kolstad, 
2000); and John Mazzeo described features of the pilot study currendy 
under way on the market basket. In the section that follows, we draw from 
the ideas presented by Mislevy, Kolstad, and Mazzeo and from other sources 
to delineate the psychometric issues that must be addressed in designing a 
NAEP market basket. 



The Market Basket Domain 

Perhaps the most critical issue for a market basket is determining the 
domain to be measured. For the current pilot study, the market basket 
domain is limited to the pool of existing or newly constructed NAEP items 
(Mazzeo, 2000). Such a domain might be selected as most desirable, but it 
is not the only way to define the market basket. Figure 4-1 depicts several 
key factors that must be considered. 

For any given content area, the first stage in developing instruction 
and assessment programs is delimitation of the targeted range of knowl- 
edge, skills, and objectives. In most cases, the range of material is too broad 
to be covered by a given instructional and assessment plan, forcing educa- 
tors to choose what they consider most important for students to know and 
learn. Under its broadest definition, the domain would include knowledge 
and skills: (1) deemed important by content experts; (2) covered by text- 
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books and other instructional material; (3) specified by state and local cur- 
riculum guides; (4) actually taught in the classroom; and (5) believed to be 
critical by the larger public. NAEPs use of matrix sampling allows it to 
define the domain broadly, as is evident in the NAEP frameworks. The 
frameworks were selected through a broad-based consensus process to bal- 
ance current educational practice and reform recommendations (National 
Research Council, 1999b). 

Because the length of any assessment is constrained by time, the collec- 
tion of items a student takes can only be expected to be a sample from the 
domain. As with other tests, frameworks guide item and task development 
for NAEP so that performance on the test items can support inferences 
about the domain. The intent is to provide a reference for test construction 
that assures the final assessment will be representative of the defined 
domain. 

In constructing the market basket, the alignment of the item pool to 
the framework, as well as the frameworks representation of the broad 
domain, have a substantial impact on the potential validity of inferences 
based on market-basket scores. Given that the pilot study defines the 
domain as the pool of existing and newly constructed NAEP items (Mazzeo, 
2000), inferences from market-basket scores to the NAEP frameworks will 
rely on how well the item pool represents the frameworks. The Committee 
on the Evaluation of National and State Assessments of Educational 
Progress (National Research Council, 1999b: 1 32), evaluated the fit of 
NAEP items to the frameworks. They concluded: 

In general, the assessment item pools are reasonably reflective of the goals for 
distributions of items set forth, in the framework matrices, particularly in the 
content-area dimensions in mathematics and science. 

However, the presence of standards-based goals in the frameworks and the 
general fit of the assessment item pools to categories in the major framework 
dimensions do not ensure that the goals of the framework have been success- 
fully translated into assessment materials. Several lines of evidence indicate 
that NAEP s assessments, as currently constructed and scored, do not ad- 
equately assess some of the most valued aspects of the frameworks, particu- 
larly with respect to assessing the more complex cognitive skills and levels and 
types of students’ understanding. 

Thus, it is not clear that defining the domain as the pool of existing or 
newly developed NAEP items will result in a set of items for the market 
basket that adequately represent the frameworks. 
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The Basis for Market-Basket Reporting 

To define the basis for market-basket reporting, two decisions must be 
made. The first relates to the set of items that set the scale for the percent 
correct or percent of maximum score; the second pertains to the method 
used to collect the data that are summarized using the market-basket ap- 
proach. The set of test items used to define the market basket could either 
be administered to students as part of data collection, or they could be 
selected from a calibrated set of items solely for the purpose of defining the 
score scale for reporting performance on the market basket. In the former 
case, the set of items could take the shape of an intact test form adminis- 
tered in its entirety to students. Given time constraints for test administra- 
tion, an administrable form would need to be relatively short, short enough 
to administer during a 40- or 50-minute testing session. We refer to this 
sort of a collection of items as a “short form.” In the latter case, the set of 
items could be assembled to represent the content and skill domain, but 
assembly of the collection would not be tied to administration of the items. 
That is, the items would be administered as part of NAEP but not necessar- 
ily in a form that would be used for reporting. This conception of the 
market basket — a collection of items that is never administered in its en- 
tirety as an intact test form — is called a “synthetic” form. Synthetic forms 
can be long or short. 

Synthetic forms can be developed to meet a variety of reporting goals. 
One alternative would be to use a very large pool of items that is too large 
to administer to any individual. This pool would yield better representa- 
tion of the NAEP framework, but it could provide more information than 
can easily be assimilated by the audiences for NAEP results. Alternatively, a 
synthetic form could be a smaller set of items that represents the NAEP 
framework in a more limited way and that could be used as a constant 
reference over time for tracking performance. Since the form would be 
short, it would not provide information about the nuances of the NAEP 
framework, only the major points. 

A third alternative could be a collection of synthetic short forms, which 
together would provide a more detailed representation of the NAEP frame- 
work than would a single short form. This would overcome the limitations 
in coverage of a single short form. Using multiple forms, however, intro- 
duces the complication of comparing results across test forms that are not 
identical. 
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Thus, the data that form the basis for market-basket reporting can be 
collected using the large pool of items as is currendy done with NAEP, or 
via a short form or multiple short forms. The means for transforming the 
scores to the reporting metric will vary depending on the data collection 
method. 



Constructing Multiple Market-Basket Forms 

As stated above, market-basket reporting could be based on a single 
short form. Given the breadth of the NAEP frameworks, however, short 
forms necessarily will be limited in the way that they represent the NAEP 
frameworks. To overcome the limitations in coverage of a single short 
form, multiple market-basket forms can be constructed to be either techni- 
cally parallel (each measures similar content and skills) or arbitrary (each 
measures different sets of content and skills). Parallel test forms are com- 
monly used in large-scale achievement testing, while arbitrary forms are 
typical of NAEP. 

Arbitrary test forms measure the same general domain of content and 
skills, but they are not necessarily constructed to be comparable. They can 
be expected to have varying test length, use different item formats, and 
differentially sample the content domain. The test parameters from each 
form will also differ. 

Parallel forms, on the other hand, more consistendy sample the do- 
main for a given group of examinees. However, the construction of parallel 
test forms presents a developmental challenge. According to Stanley (1971: 

405): 

The best guarantee of parallelism for two test forms would seem to be that a 
complete and detailed set of specifications for the test be prepared in advance 
of any final test construction. The set of specification should indicate item 
types, difficulty level of items, procedures and standards for item selection 
and refinement, and distribution of items with regard to the content to be 
covered, specified in as much detail as seems feasible. If each test form is then 
built to conform to the oudine, while at the same time care is taken to avoid 
identity or detailed overlapping of content, the two resulting test forms should 
be truly comparable. 

Depending on the degree to which parallelism is actually obtained, 
forms can be classified as classically parallel, tau-equivalent, or congeneric 
(Feldt & Brennan, 1989). The differences among these classifications per- 
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tain to the distributions of true scores, error scores, and observed scores on 
the forms. 1 

To some extent, the type of forms used in data collection will directly 
affect the score distributions for the test. Knowing how scores are expected 
to be distributed serves as an indicator for selection of appropriate statisti- 
cal tools for estimating and reporting student performance. With rare ex- 
ception, the type of test forms used for NAEP have been arbitrary forms; 
coupled with matrix sampling, their use has necessitated complex statistical 
techniques for estimating examinee performance (i.e., imputation and con- 
ditioning). If the desire is to make comparisons between market-basket 
results and main NAEP or to make predictions from one to the other, 
procedures for deriving scores for NAEP market baskets will be similarly 
complex. If there is no intention to compare results with main NAEP or to 
predict performance on one from the other, forms used to facilitate market- 
basket reporting need not follow the path of NAEP and can be constructed 
to yield more easily derivable and interpretable information. 

Statistical Methods for Linking Scores from Multiple Test Forms 

If multiple test forms are used, student performance across the forms 
will likely differ. Even if the forms were constructed in a manner intended 
to yield parallel forms (i.e., similar in content, format, difficulty, and 
length), differences in difficulties will be expected. Equating procedures 
can be used to adjust for differences in difficulty levels (though not to align 
content or make up for test length differences) and will yield scores that can 
be used interchangeably across forms (Kolen & Brennan, 1995). Percent 
correct scores based on different forms can, thus, be equated, and adjusted 
percent correct scores reported. 



Classically parallel forms must, theoretically, yield score distributions with identical 
means and variances for both observed scores and true scores. Classically parallel forms share 
a common metric. Tau-equivalent forms have the same mix of items but may differ slighdy 
with regard to the numbers of items. Tau-equivalent measures can yield different error vari- 
ances and observed score variances. True scores as well as their variances are constant across 
tau-equivalent forms as long as forms do not vary in length in any meaningful way. 
Congeneric forms include the same essential mix of knowledge and skills but may differ in 
terms of the number and difficulty of the items. The observed score distributions from 
congeneric forms may have different characteristics that may in part result from variations in 
test length. 
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Since arbitrary forms can consist of a different mix of item types and 
can vary in test length and difficulty levels, scores based on arbitrary forms 
must be linked using calibration techniques, rather than equating proce- 
dures. Further, the precision with which scores on arbitrary forms are esti- 
mated can vary both across forms and across student proficiency levels 
within a form. Given these differences among forms, Item Response 
Theory (IRT) models are most often used for linking scores from different 
forms. 



Comparisons between Market-Basket Scores and NAEP Performance 

Market-basket reporting requires some method for placing NAEP 
results on the market-basket score scale. This can be accomplished direcdy 
by administering one or more market-basket short forms to a statistically 
representative sample of the NAEP examinee population. This approach 
will not work for the long-form market basket, however, because the num- 
ber of items is too great to administer to an individual student. 

An alternative approach is to project NAEP results from a separate 
data collection onto the score scale defined by a market-basket form. The 
form can be either an administrable short form or one of a variety of syn- 
thetic forms. The methodology used for projection is statistically intensive 
because of complexities in the dimensional structure of some NAEP frame- 
works (e.g., the multiple scales) as well as the IRT and plausible values 
methodologies used for the analysis. 



Score Metrics 

There are several score metrics that can be considered for market-bas- 
ket reporting, each of which poses challenges in terms of providing NAEPs 
audiences with a more easily understood summary of performance. The 
proposed score metrics are: (1) observed scores, (2) estimated observed and I 
or true scores, (3) estimated domain referenced scores, and (4) estimated 
latent trait proficiency scores. 



Observed Scores 

The observed score metric is based on a tally of the number of right 
answers or the number of points received. The most direct method for 
obtaining observed scores is to administer one or more short forms to an 
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appropriate sample of students. The observed score that has most fre- 
quently been suggested for market-basket reporting is the percent correct 
score. Observed scores can be quickly converted to a percent correct or 
percent of maximum score by adding the number correct on the multiple- 
choice items and the points received on the constructed response items and 
then dividing the sum by the total number of possible points. Observed 
scores have the problem of being tied to the composition and difficulty of 
the collection of items on the test form. Under a configuration in which 
multiple forms were used, a method (equating or calibration) would be 
needed to adjust scores for these form differences so that the scores would 
have the same interpretation. 

At first blush, percent correct scores seem to be a simple, straightfor- 
ward, and intuitively appealing way to increase public understanding of 
NAEP results. However, they present complexities of their own. First, 
NAEP contains a mix of multiple-choice and constructed response items. 
Multiple-choice items are awarded one point if answered correctly and zero 
points if answered incorrectly. Answers to constructed response items are 
awarded a varying number of points. For some constructed response ques- 
tions, 6 is the top score; for others, 3 is the top score. For a given task, more 
points are awarded to answers that demonstrate greater proficiency. There- 
fore, in order to come up with a simple sum of the number of correct 
responses to test items that include constructed response items, one would 
need to understand the judgment behind “correct answers.” What would it 
mean to get a “correct answer” on a constructed response item? Receiving 
all points? Half of the points? Any score above zero? 

As an alternative, the percent correct score might be based not on the 
number of questions but on the total number of points. This presents 
another complexity, however. Adding the number of points would result in 
awarding more weight to the constructed response questions than the mul- 
tiple-choice questions. For example, suppose a constructed response ques- 
tion could receive between 1 and 6 points, with a 2 representing slightly 
more competence in the area than a 1 but clearly not enough competence 
to get a 6. Compare a score of 2 out of 6 possible points on this item versus 
a multiple-choice item where the top score for a correct answer is 1. A 
simple sum would give twice as much weight to the barely correct con- 
structed response item than to a correct multiple-choice item. This might 
be reasonable if the constructed response questions required a level of skill 
higher than the multiple-choice questions, such that a score of 2 on the 
former actually represented twice as much skill as a score of 1 on the latter, 
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but this is not the case for NAEP questions. Hence, some type of weight- 
ing scheme is needed. Yet, that weighting also would introduce complexity 
to the percent correct metric. 



Estimated True Score 

Reporting on a true score metric involves making a prediction from 
the observed score to the expected true score (it is a predicted score, since an 
individuals true score is never known). For a NAEP short form, the pre- 
diction would be based on the sample of administered items. A similar 
prediction would be made for the estimated observed score based on a 
longer form of which a given student takes only a portion of items. Esti- 
mated true scores could be derived from techniques aligned with either 
classical test theory or IRT. Reporting on an estimated true score or esti- 
mated observed score metric means working with predictive distributions 
of these scores which requires statistical procedures that are more complex 
than those for reporting observed number correct or percent correct scores. 



Estimated Domain Score 

As defined by Bock (1997), the estimated domain referenced score 
involves expressing scale scores in terms of the expected percent correct on 
a larger collection of items that are representative of the specified domain. 
The expected percent correct can be calculated for any given scale score 
using IRT methods (see Bock et al., 1997). This calculation would involve 
transforming observed scores, based on an assessment of part of the do- 
main, to an expected percent correct score. While derivation of this score 
would require complex procedures, it would result in scores on the metric 
(e.g., percent correct) that NAEP s sponsors consider more intuitively ap- 
pealing than an IRT proficiency score (Kolstad, 2000). 



Estimated Proficiency Score 

IRT-based procedures for estimating proficiency yield estimates re- 
ferred to as “latent trait estimates.” Use of the latent trait metric requires 
estimation of the latent trait distribution. NAEP currendy estimates latent 
trait distributions that are converted to scaled score distributions for re- 
porting. Estimating the latent trait distribution also involves complicated 
transformations from observed scores but has the advantage that, when 
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IRT assumptions are met, the distributions generalize beyond the specific 
set of administered items. Market-basket reports could use the latent trait 
(theta) metric, or latent trait scores could be converted to scaled scores, but 
reporting on this metric would not ameliorate the interpretation problems 
associated with the current NAEP reporting scale. 

THE YEAR 2000 PILOT STUDY ON 
MARKET BASKET REPORTING 

The market-basket pilot study, currently under way at ETS, was de- 
signed with three goals in mind: (1) to produce and evaluate a market- 
basket report of NAEP results; (2) to gain experience with constructing 
short forms; and (3) to conduct research on the methodological and tech- 
nical issues associated with implementing a market-basket reporting system 
(Mazzeo, 2000). The study involves the construction of two fourth-grade 
mathematics test forms, also referred to as administrable or short forms. 
Under one configuration for market-basket reporting, one of these forms 
would be released as the market basket set of exemplar items, and the other 
would be treated as a secure form for states and districts to administer as 
they see fit. The pilot study also investigates preparation and release of the 
longer version of the market basket. ETS researchers plan to simulate a 
longer synthetic form of the market basket by combining the two short 
forms. Because no student will have taken both short forms, scores for the 
long form will be derived from performance on the items and the relation- 
ships across the forms. 

The test developers hope that the study will serve as a learning experi- 
ence regarding the construction of alternate NAEP short forms, since short 
forms might be used by NAEP even without the move to market-basket 
reporting. Whereas creating intact test forms is a standard part of most 
testing programs, this is not the case with NAEP. NAEP’s current system 
for developing and field testing items was set up to support the construc- 
tion of a system of arbitrary test forms in an efficient manner and does not 
yet have guidelines for constructing market baskets or intact tests. 

A NAEP test development committee handled construction of the 
short forms. They were instructed to identify a set of secure NAEP items 
that were high quality exemplars of the pool and to select items that 
matched the pool with respect to content, process, format, and statistical 
specifications. The committee constructed two forms that could be ad- 
ministered within a 45-minute time period, one consisting of 31 items, the 
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other containing 33 items. The items were organized into three distinct 
blocks, each given during separately timed 1 5-minute test sessions. One of 
the short forms consisted of previously administered secure items; the other 
consisted of new items. Both forms were given to a random sample of 
8,000 students during the NAEP 2000 administration. The forms were 
spiraled 2 with previously administered NAEP materials to enable linking 
to NAEP 

The study’s sponsors expect the research to yield three products: (1) 
one or more secure short forms; (2) a research report intended for technical 
audiences that examines test development and data analytic issues associ- 
ated with the implementation of market-basket reporting; and (3) a report 
intended for general audiences. 

ETS researchers will continue to study alternative analysis and data 
collection methods. One of their planned studies involves conducting sepa- 
rate analyses of the year-2000 data using methods appropriate for arbitrary 
forms, methods appropriate for congeneric forms, and methods appropri- 
ate for parallel forms. Each of these sets of analyses will produce results in 
an observed score metric as well as a true score metric. Comparisons of 
results from the other approaches to the results from the arbitrary forms 
will provide concrete information about which data gathering options are 
most viable for the market-basket concept. These comparisons will evaluate 
the degree of similarity among the sets of results based on the stronger 
models, which use congeneric or parallel forms and involve less complex 
analytic procedures, and results from the arbitrary forms, which make the 
weakest assumptions but involve the most complicated analyses. If the 
results are similar, the simpler data collection and analytic procedures may 
be acceptable. In addition, comparing observed score and true score results 
for each of the approaches will inform decisions about which type of re- 
porting scale should be used. 

The year-2000 study will also evaluate the potential benefit of using 
longer market baskets. The 31 -item short forms were chosen to minimize 
school and student burden and to increase the chances of obtaining school 
participation in NAEP Other decisions regarding test length could also be 



Spiraling is an approach to form distribution in which one copy of each different form 
is handed out before spiraling down to a second copy of each form and then a third and so 
forth. The goals of this approach are to achieve essentially random assignment of students to 
forms while ensuring that approximately equal numbers of students complete each form. 
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made, such as the domain score reporting approach (Bock, 1997). (See 
Chapter 5 for a description of this approach.) Clearly, a longer collection of 
items would permit more adequate domain coverage and produce more 
reliable results. 

WORKSHOP PARTICIPANTS’ REACTIONS TO PLANS FOR 
MARKET-BASKET REPORTING 

Large-Scale Release of NAEP Items 

Participants in the committees workshop on market-basket reporting 
suggested several ways for the market-basket set of items to be used. Test 
directors and school system administrators found the idea of releasing a 
representative set of items to be very appealing and maintained this would 
help to “demystify” NAEP. In their interactions with the public, school 
officials have found that many of their constituents often question the 
amount of time devoted to testing and are unsure of how to interpret the 
results. They believe that the public is not fully aware of the range of 
material on achievement tests, the skills that students are expected to dem- 
onstrate, and the inferences that test results can support. Furthermore, the 
public does not always see the link between assessment programs and school 
reform efforts. Helping the public better understand what is being tested 
and the rationale for testing could do much to garner public support for 
continuing to gather this information. 

The release of NAEP items could also fulfill a second purpose. Even 
though the market basket set of items would be representative of NAEP, 
some state testing programs cover content similar to that assessed by NAEP. 
Therefore, NAEP s release of items could increase understanding of state 
and local assessments. 

Curriculum specialists and school administrators observed that the re- 
lease of a large number of items could stimulate discussion among teachers 
regarding the format and content of questions. Review of the items could 
facilitate discussions about how local curricula (particularly content cover- 
age and the sequencing of course material) compare with the material cov- 
ered on NAEP. Workshop speakers explained that it is often difficult to 
draw conclusions about their states’ NAEP performance because it is not 
clear whether the material tested on NAEP is covered by their curricula or 
at which grade level it is covered. 

State and local assessment directors suggested that a large-scale release 
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of NAEP items and related test materials could improve state and local 
assessment programs. Many judge NAEP items to be of high quality. 
Allowing test developers to view large amounts of NAEP test materials 
could have a positive effect on the quality of item design for state and local 
assessments. Similarly, review of items by teachers could serve to improve 
classroom-based assessments. 

While participants generally saw value in a large-scale release of items, 
some were concerned about the uses made of the items. Assessment direc- 
tors and curriculum specialists worried that a large release might unduly 
influence local and state curricula or assessments. For instance, policy mak- 
ers and educators concerned about their NAEP performance could attempt 
to align their curricula more closely with what is tested on NAEP. Because 
assessment, curricula, and instructional practices form a tightly woven sys- 
tem, making changes to one aspect of the system can have an impact on 
other aspects. Attempts to align curricula more closely to NAEP could 
upset the entire instructional program. 



Percent Correct Scores 

Nearly all speakers were skeptical about using percent correct scores to 
report performance and were doubtful that it would accomplish its intended 
purpose. Assessment directors and measurement experts commented that 
percent correct scores were not as simple as they might seem. For instance, 
would percent correct be based on the number of correct answers or the 
number of possible points? Furthermore, how could a percent correct score 
be compared to the main NAEP scale, given that main NAEP results are 
not reported on this metric? Several assessment directors commented that 
they had devoted considerable time to helping users understand achievement- 
level reporting and felt that their constituencies had become familiar with 
this reporting mechanism. Percent correct scores would require new inter- 
pretative assistance. In addition, while percent correct scores would be 
associated with the achievement levels (i.e., the percent correct at basic, 
proficient, and advanced), these percentages may not conform to public 
beliefs about performance at a given level. The percent correct required to 
be considered proficient, for instance, could turn out to be lower than the 
public would expect. Such a discrepancy could damage the public s opinion 
of NAEP. 
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CONSTRUCTION OF THE CONSUMER PRICE INDEX 
MARKET BASKET 

During the course of this study, the materials NAEP s sponsors pro- 
vided to the committee described general ideas for the NAEP market basket 
but did not present firm proposals for its design. The committee believed 
it would be instructive to learn about summary indicators used in other 
fields. Because the NAEP market basket has been linked with the CPI 
from its inception, the committee thought it would be useful to learn more 
about how the CPI was constructed and how it might be applied in an 
educational setting. During the committees workshop on market-basket 
reporting, Kenneth Stewart from the Bureau of Labor Statistics (BLS) 
described the processes and methods used for deriving and utilizing the 
CPI. Stewarts remarks are summarized below; additional details about the 
CPI appear in Appendix B. 

Background and Current Uses of the CPI 

The CPI is a measure of the average change over time in the prices paid 
by urban consumers in the United States for a fixed basket of goods in a 
fixed geographic area. The CPI was developed during World War I so that 
the federal government could establish cost-of-living adjustments for work- 
ers in shipbuilding centers. Today, the CPI is the principal source of infor- 
mation concerning trends in consumer prices and inflation in the United 
States. It is widely used as an economic indicator and a means of adjusting 
other economic series (e.g., retail sales, hourly earnings) and dollar values 
used in government programs, such as payments to Social Security recipi- 
ents and to Federal and military retirees. The BLS currendy produces two 
national indices every month: the CPI for All Urban Consumers and the 
more narrowly based CPI for Urban Wage Earners and Clerical Workers, 
which is developed using data from households represented in only certain 
occupations. In addidon to the national indexes, the BLS produces indexes 
for geographic regions and collective urban areas. Compositions of the 
regional market baskets generally vary substantially across areas because of 
differences in purchasing patterns. Thus, these indexes cannot be used for 
reladve comparisons of the level of prices or the cost of living in different 
geographic areas. 
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Collection of Data on Consumer Expenditures 

The BLS develops the CPI market basket on the basis of detailed infor- 
mation provided by families and individuals about their actual purchases. 
Information on purchases is gathered from households in the Consumer 
Expenditure Survey, which consists of two components: an interview sur- 
vey and a diary survey. 3 Each component has its own questionnaire and 
sample. 

In the quarterly interview portion of the Consumer Expenditure sur- 
vey, an interviewer visits every consumer in the sample every 3 months over 
a 12-month period. The Consumer Expenditure interview survey is de- 
signed to collect data on the types of expenditures that respondents can be 
expected to recall for a period of 3 months or longer. These expenditures 
include major purchases, such as property, automobiles, and major appli- 
ances, and expenses that occur on a regular basis, such as rent, insurance 
premiums, and utilities. Expenditures incurred on trips are also reported in 
this survey. The Consumer Expenditure interview survey thus collects de- 
tailed data on 60 to 70 percent of total household expenditures. Global 
estimates — i.e., expense patterns for a 3-month period — are obtained for 
food and other selected items, accounting for an additional 20 percent to 
25 percent of total household expenditures. 

In the diary component of the Consumer Expenditure survey, con- 
sumers are asked to maintain a complete record of expenses for two con- 
secutive one-week periods. The Consumer Expenditure diary survey was 
designed to obtain detailed data on frequently purchased small items, in- 
cluding food and beverages (both at home and in eating places), tobacco, 
housekeeping supplies, nonprescription drugs, and personal care products 
and services. Respondents are less likely to recall such items over long 
periods. Integrating data from the interview and diary surveys thus pro- 
vides a complete accounting of expenditures and income. 

Both the interview and diary surveys collect data on household charac- 
teristics and income. Data on household characteristics are used to deter- 
mine the eligibility of the family for inclusion in the population covered by 
the Consumer Price Index, to classify families for purposes of analysis, and 
to adjust for nonresponse by families who do not complete the survey. 



3 Much of the material in this section is excerpted from Appendix B of Consumer Ex- 
penditure Survey* 1996-97* Report 935, Bureau of Labor Statistics, September 1999. 
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Household demographic characteristics are also used to integrate data from 
the interview and diary components. 

Construction of the CPI Market-Basket System 

The BLS prices the CPI market basket and produces the monthly CPI 
index using a complex, multistage sampling process. The first stage in- 
volves the selection of urban areas that will constitute the CPI geographic 
sample. Because the CPI market basket is constructed using data from the 
Consumer Expenditure survey, the geographic areas selected for the CPI 
for All Urban Areas are also used in the Consumer Expenditure survey. 
Once selected, the CPI geographic sample is fixed for 10 years until new 
census data become available. Using the information supplied by families 
in the Consumer Expenditure surveys, the BLS constructs the CPI market 
basket by partitioning the set of all consumer goods and services into a 
hierarchy of increasingly detailed categories, referred to as the CPI item 
structure. Each item category is assigned an expenditure weight, or impor- 
tance, based on its share of total family expenditures. One can ultimately 
view the CPI market basket as a set of item categories and associated expen- 
diture weights. 



Updating and Improving the CPI Market Basket 

Because of the many important uses of the monthly CPI, there is great 
interest in ensuring that the CPI market basket accurately reflects changes 
in consumption over time. Each decade, data from the U.S. census of 
population and housing are used to update the CPI process in three key 
respects: (1) redesigning the national geographic sample to reflect shifts in 
population; (2) revising the CPI item structure to represent current con- 
sumption patterns; and (3) modifying the expenditure weights to reflect 
changes in the item structure as well as reallocation of the family budget. 

CONCLUSIONS AND RECOMMENDATIONS 

It is apparent from the discussion in this chapter that all decisions 
about the configuration and features of the NAEP market basket involve 
tradeoffs. Some methods for configuring the market basket would result in 
simpler procedures than others but would not support the desired infer- 
ences. Other methods would yield more generalizable results but at the 
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expense of simplicity. The simplest methods would use parallel short forms 
for data collection and observed scores for reporting, but this configuration 
may not yield forms and scores generalizable to the larger content domain. 
The most generalizable results would be based on a system of arbitrary 
forms with performance reported as the estimated proficiency score (i.e., 
latent trait estimates), as is currently done with NAEP. However, this is 
also one of the most complex configurations. 

If NAEP s sponsors decide to proceed with designing a market basket, 
decision making about its configuration should be based on a clear articu- 
lation of the purposes and objectives for the market basket. The needs the 
market basket will serve and the intended uses should guide decisions about 
its features. 

RECOMMENDATION 4- Is All decisions about the configu- 
ration of the NAEP market basket will involve tradeoffs. Some 
methods for configuring the market basket would result in 
simpler procedures than others but would not support the de- 
sired inferences. Other methods would yield more generaliz- 
able results but at the expense of simplicity. If the decision is 
made to proceed with designing a NAEP market basket, its 
configuration should be based on a clear articulation of the 
purposes and objectives for the market basket. 

CPI Market Basket Versus A NAEP Index: Parallels and Contrasts 

The task of building an educational parallel to the CPI is formidable 
and appears to differ conceptually from the current NAEP market-basket 
development activities. It is unknown how well the final market-basket 
instrument, in whatever format, will serve its major goal of better inform- 
ing the American public regarding the educational accomplishments of its 
students. The eventual attainment of this goal must begin with a definition 
of educational accomplishments along with serious consideration of the 
psychometric properties of the instruments that must be in place to sup- 
port the desired score inferences. 

In considering the proposals to develop and report a summary measure 
from the existing NAEP frameworks, the committee realized that the pro- 
posals for the NAEP market basket differ fundamentally from purpose and 
construction of the CPI market basket. Although the NAEP frameworks 
are developed by committees of experts familiar with school -level curricula, 
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they are not descriptive; that is, they are not based on surveys of what 
schools actually teach. 

Implementing a market-basket approach for NAEP, analogous to that 
used by the CPI, would thus necessitate major operational changes. The 
design that would most directly parallel that of the CPI would call for 
surveying classrooms to determine the content and skills currently being 
taught to students. This is analogous to surveying households to find out 
what consumers are buying. In the CPI context, the household surveys 
create a market basket of goods. In the NAEP context, the surveys would 
lead to a “market basket” of instructional content that would need to reflect 
regional differences in what is taught. Test forms would be constructed to 
represent this instructional content and administered to evaluate students’ 
mastery of the material. The resulting scores would indicate how much 
students know about the currently taught subject matter. Hence, if the 
NAEP market basket were constructed to parallel the CPI market basket, it 
would include items representing what survey data show is currently taught 
in classrooms. 



CONCLUSION 4- Is Use of the term “market basket” is mis- 
leading, because (1) the NAEP frameworks reflect the aspira- 
tions of policy makers and educators and are not purely de- 
scriptive in nature and (2) the current operational features of 
NAEP differ fundamentally from the data collection processes 
used in producing the CPI. 

RECOMMENDATION 4-2: In describing the various propos- 
als for reporting a summary measure from the existing NAEP 
frameworks, NAEP’s sponsors should refrain from using the 
term “market basket” because of inaccuracies in the implied 
analogy with the CPI. 



RECOMMENDATION 4-3: If, given the issues raised about 
market-basket reporting, NAEP’s sponsors wish to pursue the 
development of this concept, they should consider developing 
an educational index that possesses characteristics analogous 
to those of the Consumer Price Index: (1) is descriptive rather 
than reflecting policy makers’ and educators’ aspirations; (2) is 
reflective of regional differences in educational programs; and 
(3) is updated regularly to incorporate changes in curriculum. 
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Changed NAEP: 

Use of a Short-Form Version of NAEP 



As currently configured, NAEP employs a matrix sampling method for 
administering items to students (see Chapter 2) but does not include the 
option for administering a fixed test form to large numbers of individuals 
(Allen, Carlson, & Zelenak, 1998). Implementing such an option will 
require major changes to the way NAEP test forms are constructed and 
NAEP results are reported. Such changes are certainly within the realm of 
possibilities, however. NAGB has active working groups (National Assess- 
ment Governing Board, 1999c; National Assessment Governing Board, 
2000a) looking into alternate delivery and reporting models for NAEP, and 
the short form and market-basket concepts originated from the activities of 
those groups. 

This chapter deals explicitly with the short form and addresses the 
questions: (1) What role might a short form play in providing market- 
basket results; and (2) How might the short form be used? The chapter 
begins with a discussion of NAGB s policy and plans for the short form, 
which is followed by a description of the ways states and districts might use 
the short forms based on comments from participants in the committees 
workshop on market-basket reporting. The chapter continues with a review 
of the pilot study of short forms and ends with discussion of ways to con- 
struct short forms. 
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STUDY APPROACH 

During the course of the study, we reviewed policy statements address- 
ing the short form (National Assessment Governing Board, 1996; National 
Assessment Governing Board, 1999a; National Assessment Governing 
Board, 1999b; Forsyth et ah, 1996) and information on the ETS year 2000 
pilot study on market-basket reporting (Mazzeo, 2000). We asked Patricia 
Kenny, co-director of the National Council of Teachers of Mathematics 
(NCTM) NAEP Interpretive Reports Project, to review the two short forms 
developed for the pilot study. We also focused specifically on the short form 
during our workshop on market-basket reporting and asked participants to 
discuss their interest in and potential uses for the short form of NAEP (see 
Chapter 4 for additional details on the workshop) . 

WHAT ARE NAEP SHORT FORMS AND 
HOW MIGHT THEY BE USED? 

Policy for the NAEP Short Form 

In the most recent redesign policy, the short form is cited as a mecha- 
nism for simplifying NAEP design, specifically (National Assessment Gov- 
erning Board, 1999a:7): 

Plans for the short-form of the National Assessment, using a single test book- 
let, are being implemented. The purpose of the short-form test is to enable 
faster, more understandable initial reporting of results and, possibly, for states 
to have access to test instruments allowing them to obtain NAEP assessment 
results in years in which NAEP assessments are not scheduled in particular 
subjects. 

To guide policy and decision making on the measurement issues per- 
taining to the short forms, NAGB adopted the following principles 
(National Assessment Governing Board, 1999b): 

Principle 1: The NAEP short form shall not violate the Congressional prohi- 
bition to produce, report, or maintain individual examinee scores. 

Principle 2: The Board shall decide which grades and subjects shall be as- 
sessed using a short form. 

Principle 3: Development costs, including item development, field testing, 
scoring, scaling, and linking shall be borne by the NAEP program. The costs 
associated with use, including administration, scoring, analysis, and report- 
ing shall be borne by the user. 

Principle 4: NAEP short forms intended for actual administration should 
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represent the content of corresponding NAEP assessment frameworks as fully 
as possible. Any departure from this principle must be approved by the Board. 
Principle 5: Since it is desirable to report the results of the short form using 
the achievement levels, the content achievement level descriptions should be 
considered during the development of the short form. 

Principle 6: All versions of the short form should be linked to the extent 
possible using technically sound statistical procedures. 



The National Assessment Governing Board s Vision and 
Uses for the Short Form 

At the committee s workshop on market-basket reporting, Roy Truby, 
executive director of NAGB, explained the concept of the NAEP short 
form, describing it as a short, administrable test representative of the con- 
tent domain tested on NAEP (Truby, 2000). Results on the short form 
could be summarized using a percent correct metric. The short form could 
provide additional data collection opportunities that are not part of the 
standard NAEP schedule, such as testing in off years or in other subjects 
not assessed at the state level. Truby described how some people envision 
using a short form: 

If short forms were developed and kept secure, they could provide flexibility 
to states and any jurisdiction below the state level that were interested in 
using NAEP for surveying student achievement in subjects, grades, and times 
that were not part of the regular state-NAEP schedule. Once developed, 
such market-basket forms should be faster and less expensive to administer, 
score, and report than the standard NAEP, and could provide score distribu- 
tions without the complex statistical methods on which NAEP now relies. 

This might help states and others link their own assessments to NAEP, which 
is another important objective of the Board s redesign policy. 

Truby noted that the details associated with these components of the 
market-basket concept have not yet been thoroughly investigated. Based 
on the pilot study findings (see Mazzeo, 2000), NAGB might pursue simi- 
lar studies in other content areas and grades. 



Workshop Participants’ Visions and Uses for the Short Form 

Some school administrators and directors of assessment were attracted 
to the concept of the short form as a means for obtaining benchmarking 
data. They envisioned the short form as a test that could be administered 
to an entire cohort of students (e.g., all fourth grade students in a school or 
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in a district); short form results could be quickly derived and aggregated to 
the appropriate levels (i.e., school or district level). Under this vision for the 
short form, summaries of short-form results could be compared to those 
for national NAEP to provide schools and districts with information on 
how their students’ achievement compared with national results. Partici- 
pants believed that this information would be uniquely useful in assessing 
students’ strengths and weaknesses and in setting goals for improving stu- 
dent achievement. 

Some school administrators and assessment directors also envisioned 
the short form as a set of questions that could be embedded into current 
assessments as a mechanism for “linking” results from current assessments 
to NAEP. Under this vision, the set of questions could be administered in 
conjunction with other state or local assessments. Short form results could 
be used to enable comparisons between state and local assessment and main 
NAEP. It is important to point out that the issues associated with establish- 
ing linkages between NAEP and state and local assessments were previously 
addressed by two other NRC committees (National Research Council, 
1999a; National Research Council, 1999d), who cited numerous problems 
with such practices. 

Curriculum specialists saw the short form as a way to gather additional 
information about what is tested on NAEP and how it compares to their 
instructional programs. The released short form could permit educators 
and policy makers to have first-hand access to the material included on the 
test. Their review of the released material could promote discussions about 
what is tested and how it compares with the skills and material covered by 
their own curriculum. The secure short form would yield data that could 
further these discussions. Educators could examine student data and evalu- 
ate performance in relation to their local practices. They could engage in 
discussions about their curricula, instructional practices, and sequencing of 
instructional material, and could contemplate changes that might be 
needed. 

Participants also liked the idea of having a NAEP test to administer in 
“off-years” from regular NAEP administrations. Because NAEP does not 
currently administer every subject to every grade every year, workshop par- 
ticipants believed the short form could help fill the “gaps.” The short form 
could be given every year thereby enabling the compilation of yearly trend 
data. These uses for the short forms are discussed in greater detail in the 
workshop summary (National Research Council, 2000). 
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Workshop Participants’ Concerns About the Short Forms 

Some workshop speakers challenged the premises behind the various 
uses for the short form. Several questioned how comparable scores on the 
short forms would be to results from NAEP. As described in Chapter 2, 
NAEP uses complex procedures for deriving score estimates (including the 
conditioning and plausible values methodologies). If short form results 
were provided quickly and without the complex statistical methods, results 
from the short form would not be conditioned; hence short form results 
would not be comparable to the regular NAEP-scale results. 

Comparisons between short form results and state or local assessment 
results also might not yield the type of information desired. State and local 
assessments are part of an overall program in which curricula, instruction, 
and assessments are aligned. However, alignment may not extend to the 
NAEP frameworks, and the short form might test areas not covered by the 
curriculum. While it might be enlightening to compare NAEP s coverage 
with local curricula, testing students on material they have not been taught 
presents problems for interpreting the results. 

Student motivation would also factor into performance on the short 
form. State and local assessments tend to be higher stakes exams that carry 
consequences. At present, NAEP is not a high-stakes test. Administration 
of the short form as part of a high-stakes assessment would change the 
context in ways that could affect the comparability between results on the 
short form and the regular NAEP results. 

The prohibition against individual results was also cited as problem- 
atic. The short form could be administered in a manner closely resembling 
other testing — testing that results in individual score reports. Although 
individual results would be generated initially from the short form, they 
would need to be aggregated for reporting purposes. Participants felt that 
this prohibition would be difficult to explain. These concerns about the 
short form are discussed in detail in the workshop summary (National Re- 
search Council, 2000). 



Review of the Pilot Short Forms 

As explained in Chapter 4, ETS prepared two fourth grade mathemat- 
ics short forms as part of the year 2000 pilot study. One of the two pilot 
short forms contains 31 items and the other 33. These items were intended 
to represent NAEP s existing fourth grade mathematics item pools. During 
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the workshop, Bock (2000) estimated that the reliability of the short form 
would be likely to fall in the low .80 range. While this might be considered 
acceptable, the more pertinent concern for the short form is not reliability 
but generalizability. That is, would performance on the short-form support 
inferences about performance on the larger domain of mathematics? For 
the workshop, the committee asked Patricia Kenney to consider the 
feasibility of creating short forms for fourth grade mathematics and the 
extent to which the developed short forms were representative of what 
NAEP tests. 

Kenney reported that the short forms appeared to represent the general 
content strands and the item types in the frameworks. However, she ques- 
tioned whether the forms covered the full range of cognitive processes the 
framework describes, as well as all of the 56 topics and subtopics covered by 
the NAEP frameworks. Kenney questioned the extent to which approxi- 
mately 30 items would be able to adequately represent the frameworks at 
the topic or subtopic level (Kenney, 2000). Additionally, NAEP items can 
be administered at more than one grade level. Because NAEP results are 
not reported at the student level, there is no disadvantage for assessing 
students on topics that they may not have studied. The problem with these 
“grade overlap” items, however, is that they might become misinterpreted 
as NAEP grade-appropriate expectations. Considering the uses cited above 
for the short forms, Kenney was concerned about how these grade overlap 
items would be regarded (Kenney, 2000). 

THE DESIRED CHARACTERISTICS OF A SHORT FORM 

Given the alternative visions and uses described above, we can now 
consider options for constructing and implementing short form NAEP. 
NAGB policy (National Assessment Governing Board, 1999b) states that 
“NAEP short forms intended for actual administration should represent 
the content of corresponding NAEP assessment frameworks as fully as pos- 
sible” (Principle 4). This statement implies that NAGBs intent is to pro- 
duce short forms that are samples of the domain represented by the frame- 
work. While it does not seem to be the intent to represent the current 
NAEP item pool, or to create scales, the short form needs to be capable of 
providing estimates of the true score distribution that is the target for full 
NAEP. That distribution is needed to support policy Principle 5, “Since it 
is desirable to report the results of the short form using the achievement 
levels, the content achievement level descriptions should be considered dur- 
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ing the development of the short form.” Reporting results according to the 
achievement levels requires an accurate estimation of the proficiency distri- 
bution. Estimation of the distribution requires the specification of a scale. 

NAGB policy does not provide any further guidance about the desired 
technical characteristics for the short form. While the generality of policy 
statements is appropriate so that developers are not limited in the ap- 
proaches they consider for putting policy into practice, the lack of detail 
allows a variety of interpretations. For example, the state and district test 
directors imagine a short form of 1 0 to 15 items that can be embedded in 
their tests as anchors to link their tests to the NAEP scale (O'Reilly, 2000) . 
The ETS-produced pilot versions contain 31 and 33 items (Mazzeo, 
2000) — twice the length imagined by the test directors. Since the ETS 
pilot short form was limited by other constraints, additional conceptions 
would also be feasible. 

The NAGB materials (National Assessment Governing Board, 1999b) 
and the discussions at the workshop (National Research Council, 2000) 
imply the following specifications for the short form. 

1 . The short form should represent the NAEP framework. 

2. The short form should be at least somewhat consistent with the 
achievement level descriptions. 

3. It should be possible to aggregate the data from the short forms to 
provide good estimates of mean performance for subgroups of the 
student population. 

4. It should be possible to estimate the proportion above the achieve- 
ment level cutscores. This implies that the short form can support 
estimation of the distribution of scores on the NAEP scale. 

5. It should be possible to compare results from alternate forms of 
short forms for a curriculum area, which implies that the short 
forms are to be put on a common score scale, perhaps through an 
equating process. 

6. Some would like to use the short form as an anchor test for con- 
necting other testing programs to the NAEP reporting scale. This 
use is not addressed by the policy for the short form. 

These specifications present a challenging development task because 
the short form will necessarily have different psychometric characteristics 
than the full set of current NAEP items or any one of the NAEP booklets. 
Successful accomplishment of this development task depends on the degree 
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to which each requirement must be met. For example, if the level of accu- 
racy of a mean estimate from the short form does not have to be as great as 
that for the full NAEP, then requirement 3 can probably be met. However, 
if the level of accuracy of the mean estimates must be the same as for full 
NAEP, then the design of the short form and its administration plan will be 
very challenging. To assist NAEP s sponsors with these difficult issues, we 
consider them next. 



MEETING THE DESIRED SPECIFICATIONS 
FOR THE SHORT FORMS 

Representing the Frameworks 

The first requirement is that the short form should represent the NAEP 
framework. However, “represent” is open to multiple interpretations. One 
interpretation is a formal statistical sampling from a population. If every 
item in the domain had an equal chance of being sampled, the resulting 
sample would represent the entire population. The short form could then 
either represent the domain (i.e., the framework) or the current NAEP 
pool. These are not synonymous because the current NAEP pool may not 
“represent” the NAEP framework in any statistical sense; that is, the items 
in the NAEP pool are not a random sample from the domain. 

A short form could be constructed to “represent” the current NAEP 
pool in a statistical sense by randomly sampling items from that pool. Such 
a sample might not include items from every content specification cat- 
egory, but they would be an unbiased statistical sample and would there- 
fore represent the larger number of items. 

A more general interpretation of “represent” might be that the short 
form provides “examples” of the types of tasks required by NAEP. Under 
this interpretation, the NAEP item pool would be considered to represent 
the framework, and any set of items that assesses the skills listed in the 
framework would represent the framework by example. Because the frame- 
work is very broad, it would be impossible to present sample items for 
every type of skill and knowledge in the framework. Thus for practical 
reasons, the short forms representation of the framework must be incom- 
plete, and the short form would represent the framework less well than the 
full NAEP pool represented the framework. 

An even looser interpretation of “represent” could be that the items on 
a short form provide selected examples of the kinds of items developed 
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from the framework. The items in the NAEP pool could be sorted accord- 
ing to the skills required by the achievement level descriptions to help meet 
requirement 2. While these sortings would not be perfecdy reliable, they 
could support a loose definition of representation. Either a stratified sam- 
pling of 30 items could be drawn from that pool, or a carefully reasoned 
sample could be selected to produce a descriptive example of the pool. If 
the current NAEP items cannot be used, new items could be produced that 
measure skills consistent with the frameworks document. All of these op- 
tions meet a loose definition of “represent.” 



Approaches to Constructing Short Forms 

The Standards for Educational and Psychological Tests (American Educa- 
tional Research Association, American Psychological Association, & 
National Council on Measurement in Education, 1999) present guidelines 
to be followed in constructing a short form version of a longer test. Specifi- 
cally, Standard 3.16 states: 

If a short form of a test is prepared, for example, by reducing the number of 
items on the original test or organizing portions of a test into a separate form, 
the specifications of the short form should be as similar as possible to those of 
the original test. The procedure for reduction of items should be docu- 
mented. 

Given these guidelines, we describe two procedures that could be used to 
develop short forms. 



Domain Sampling Approach 

Given that the goal of NAEP is to assess the knowledge and skills of 
fourth, eighth, and twelfth grade students in the areas defined by the frame- 
works, the measurement model that seems most appropriate to this task is 
domain sampling (Nunnally, 1967:175). If a domain sampling approach 
were to be used, the NAEP framework would define the domain, and the 
goal of test development would be to produce an instrument that contains 
tasks that are an appropriate sample from that domain. Ideally, the frame- 
work would be translated into specifications that clearly delimit the types 
of items included in the domain. 

With this approach, developers would produce many items that repre- 
sented the domain, and forms would be developed by sampling from the 
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set of items. For the purposes of the present discussion, the full set of all 
NAEP items included in all forms given to students during an operational 
NAEP administration will be considered the long form. It cannot be con- 
sidered a long form in the usual sense, because no student would take all of 
the items. However, the "long form” would define the score scale for re- 
porting NAEP results. The short form would simply be a test containing 
fewer items than the long form. 

Under a domain sampling approach, a short form of NAEP could be 
developed by selecting a smaller sample of items than for the long form. 
This process for creating a short form would address Standard 3.16 because 
the specifications for the domain are the same for both the short and long 
form. If formal statistical sampling procedures were used, both the long 
form and the short form would represent the full domain but to different 
degrees of accuracy. 

The NAEP item and form development process has not been as formal 
as the domain sampling model. A large pool of items has been produced to 
match the content and cognitive skills described in each framework docu- 
ment, but the items that have been produced were not intended to be a 
statistically representative sample from the domain (Allen, Carlson, & 
Zelenak, 1998). The framework documents do not define clear boundaries 
for the domain (Forsyth, 1991), and no criteria are given for determining 
whether or not an item is a part of the domain. At best, the items in a set of 
NAEP booklets for a content area can be considered to be a sample from 
the domain, but a sample with unknown statistical properties. 

Hence, construction of a short form becomes more challenging than 
merely taking a statistical sample from a well-defined pool of items. Be- 
cause the NAEP forms are an idiosyncratic sample from the domain, the 
best approach from a domain sampling perspective is to select a sample of 
items from the current set of items. The resulting sample would be repre- 
sentative of the items on a current set of NAEP forms, but would not 
necessarily be representative of the full domain. The stratified random 
sampling plan could be used to make sure that important content strands 
are proportionally represented. 



Scale Construction Approach 

An alternative procedure might be based on the trait estimation ap- 
proach commonly used in psychology (McDonald, 1999), which defines a 
hypothetical construct and then selects test items estimated to be highly 
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correlated with the construct. While the resulting set of items defines a 
scale for the construct, there is no intention to define a domain of content 
or to sample from the domain. The test development process is considered 
effective as long as the set of items rank orders individuals on the scale for 
the hypothetical construct. 

Employing this approach with NAEP would imply that NAEP s pur- 
pose is to place students along one or more continua based on their re- 
sponses to the test items. The items would be selected to define scales 
rather than to represent the domain. To be consistent with the require- 
ments of Standard 3.16, the short form would have to define the same 
scales as the full NAEP. 



Precision of Measurement 

Either approach to developing a short form would result in a test with 
different measurement properties than a “long” form. For instance, scores 
from the short form will have less precision of measurement than a test 
consisting of the full set of current NAEP items. The comment to Standard 
3.16 addresses the differences in measurement properties and calls for their 
documentation, saying: 

The extent to which the specifications of the short form differ from those of 
the original test, and the implications of such differences for interpreting the 
scores derived from the short form, should be documented. 

One clear difference between the short form and the long form is that 
scores from the short form will have a different reliability and standard 
error structure 1 than those from the full NAEP pool even though the short 
form and full NAEP provide samples from the same domain of content 
(National Research Council, 2000). 

If the domain sampling approach is used, the short form will result in 
greater sampling error than full NAEP because a smaller sample is taken 
from the content domain. Although both sets of items (test forms) would 
represent the domain, and both would measure the same constructs, the 
smaller sample would have larger estimation error. 



Standard error structure refers to the pattern of conditional standard errors of mea- 
surement at different points on the reporting score scale. Because of the different lengths of 
the two forms, the conditional standard errors will certainly not be the same at every point 
on the score scale. 
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Under the scale formation approach, the content framework deter- 
mines the number of scales that need to be considered. For example, NAEP 
Mathematics reports scores that are weighted composites of five scales (Na- 
tional Assessment Governing Board, 2000) that are combined using weights 
to form a composite that is used for reporting. When the test is shortened, 
the number of scales would remain the same, but fewer items would be 
used to define the scales. Because the scale of measurement for the short 
form would be defined with less fine gradations than defined by the full set 
of items, scores would be estimated with less precision of measurement. 

Discussions of the relative standard error of measurement for the short 
form and the full NAEP must be carefully considered. In the matrix sam- 
pling design used by NAEP, the standard error of measurement for a stu- 
dent is large for long form NAEP — possibly larger than the standard error 
of measurement for a hypothetical short form. However, estimates of popu- 
lation parameters, such as the population mean and standard deviation, are 
based on the full set of items and the full sample of students, and they use 
collateral background information to “condition” the estimation process 
(see Chapter 2). Consequently, the estimation of population parameters 
should be much more precise for full NAEP than for a short form even 
though the short form might yield smaller measurement error for a students 
score if individual scores were permitted to be generated for NAEP. 



Technical Requirements for a Short Form 

The technical requirements for a short form are very challenging. Re- 
quirements 3 and 4 suggest that the short form allow estimation of means 
and percentages of distributions on the NAEP scale. This implies that the 
short form would produce scores on the same composite of skills as the full 
NAEP pool. This is also required by Standard 3.16. Producing a short 
form that will result in scores that fulfill the statistical requirements will 
require careful matching of content and statistical characteristics of the 
items on the short form to the NAEP item pool. This can best be done 
using multidimensional procedures to select items that create the desired 
composite score and a score distribution that is similar to that from the full 
NAEP sample. In theory, this could be accomplished using the full set of 
tools available from IRT and computerized test assembly methodologies. 
Even with those tools, however, the test assembly process will be difficult, 
and it will be necessary to confirm that the desired composite of abilities is 
assessed. 
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CONCLUSIONS AND RECOMMENDATIONS 

The committees review of the materials on the short form concept 
indicates that NAGB and potential consumers of short form results have 
varying conceptions of the short form. Some (McConachie, 2000; O’Reilly, 
2000) believe the short form should function as an anchor test that can be 
used to link various types of assessments to NAEP so the results can be 
reported on the NAEP score scale. Others (Mazzeo, 2000; Truby, 2000) 
view the short form as a mechanism for implementing market-basket re- 
porting or as a way of facilitating district-level reporting and providing 
more responsive reporting of NAEP results (O’Reilly, 2000; Truby, 2000). 
These differing views about the short form make it difficult for the com- 
mittee to make specific recommendations because so many details have yet 
to be decided. Nevertheless, the conception of many workshop participants 
that the short form could be used as an anchor to put state assessment 
results on the NAEP scale is not likely to be tenable. The difficulties associ- 
ated with attempts to achieve such links among assessments have been docu- 
mented in previous reports by other NRC committees (National Research 
Council, 1999a; National Research Council, 1999d). 

CONCLUSION 5-1. Thus for, the NAEP short form has been 
defined by general NAGB policy, but it has not been devel- 
oped in sufficient technical and practical detail that potential 
users can react to a firm proposal. Instead, users are project- 
ing into the general idea their own desired characteristics for 
the short form, such as an anchor for linking scales. Some of 
their ideas and desires for the short form have already been 
determined to be problematic. It will not be possible for a 
short form design to support all uses described by workshop 
participants. 

The most positive result that can be expected from attempts at short 
form construction is that the short form is shown to measure the same 
composite of skills and knowledge as the full NAEP pool and that the 
distribution of statistical item characteristics is such that the shape of the 
estimated score distribution will be similar, though not identical, to that for 
current NAEP. The distribution will probably not be exactly the same 
because of differences in the error distribution that result from using a 
shorter test. The practical result is that the mean scores estimated from the 
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short form will probably have larger standard errors than those from the 
full NAEP and that the estimates of proportions above the achievement 
level cutscores will also contain more error. The results from the short form 
will probably look different than those from full NAEP, even if exactly the 
same students took both types of tests. The differences in error will add 
“noise” to the results of the two types of tests in different ways. 

Comparisons of short form and full NAEP results will not be easy, 
even for technically sophisticated consumers. The fact that the two sets of 
results are not directly comparable does not mean that the short form might 
not be useful. It does mean, however, that the differences in interpretation 
must be made clear to avoid confusion. One way would be to use different 
score scales and to report short form scores as estimates of the proportion of 
the full NAEP pool that students would get correct rather than scores on 
the NAEP score scale. In this case, the error in estimates could be indicated 
with error bars or other reporting methods. Use of different score scales 
would preclude making direct comparisons, but the short form may still 
have value as a more frequent monitor of student capabilities. However, it 
is worth restating here that, to many workshop participants, being able to 
make comparisons with main NAEP was one of the more appealing features 
of the short form. 

CONCLUSION 5-2: The method selected for producing a 
short form will likely result in a test that has a different reli- 
ability (error structure) than the full NAEP, resulting in differ- 
ent estimates of the score distribution than the fall NAEP. As 
a result, the short form will likely give different numerical re- 
sults than the full NAEP, even if the samples of students were 
identical. 

RECOMMENDATION 5-1: Before attempting to use a short 
form version of NAEP to estimate results on the current NAEP 
scale, the differences in the psychometric characteristics of the 
scores from the short form and current NAEP should be care- 
fully investigated. 

RECOMMENDATION 5-2: Before proceeding with the short 
form, it should be determined whether it is possible to obtain 
estimates of NAEP score distributions from the short form 
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that will provide estimates of proportions above achievement 
levels and means for subgroups of the examinee population 
that are of similar accuracy to those from current NAEP. 

RECOMMENDATION 5-3: If the decision is made to pro- 
ceed with the short form, methods should be developed for 
reporting performance on the short form in a way that is mean- 
ingful and not misleading given the differences in quality of 
estimates for current NAEP and the short form. 
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Designing Reports of District-Level and 
Market-Basket NAEP Results 



The goal of NAEP is to inform our society about the status of educa- 
tional achievement in the United States and, more recendy, in specific states. 
Currendy, policy makers are considering if NAEP data gathered from still 
smaller geopolitical units and based on smaller numbers of test items can 
be used to generate meaningful reports for a variety of constituents. These 
proposed reporting practices emanate from desires to improve the useful- 
ness and ease of interpretation of NAEP data. Both proposals call for close 
attention to the format and contents of the new reports. 

When NAEP first proposed producing state-level results, a number of 
concerns were expressed about potential misinterpretation or misuse of the 
data (Stancavage et al., 1992; Hartka & Stancavage, 1994). With the provi- 
sion of below-state NAEP results, the potential for reporting/ misinterpre- 
tation problems is also high. If readers are proud, distressed, or outraged by 
their statewide results, their reaction to district or hometown results are 
likely to be even stronger. In addition, the wider variety of education and 
media professionals providing the public with information about local-level 
test results is also likely to contribute to potential interpretation problems. 
These professionals may have a greater variety of positions to promote as 
well as more varied levels of statistical sophistication. In short, consider- 
ation of effective reporting formats may become more urgent. 

Even if the proposals for district-level and market-basket reporting do 
not come to fruition, attention to the way NAEP information is provided 
would be useful. As described in Chapter 2, the types of NAEP reports are 
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many and varied. The information serves many purposes for a broad con- 
stellation of audiences, including researchers, policy makers, the press, and 
the public. These audiences, both the more technical users and the lay 
public, look to NAEP to support, refute, or inform their ideas about the 
academic accomplishments of students in the United States. The messages 
taken from NAEP s data displays can easily influence their perceptions 
about the state of education in the United States. 

Generally, both technical users and the lay public tend to extract what- 
ever possible from data displays. Unfortunately, the “whatever possible” 
often translates to “very little” for at least two reasons. First, readers may 
pay very little attention to data reports, feeling that the time required to 
decode often arcane reports is not well spent; the data are not worth the 
additional effort. Second, even when readers carefully study the displays, 
they might misinterpret the data. Even well-intentioned report designs fall 
prey to the cognitive and perceptual misinterpretations of the most serious 
reader (Monmonier, 1991; Cleveland & McGill, 1984;Tversky & Schiano, 
1989). 

Earlier chapters of this report have focused on the feasibility and desir- 
ability of collecting and reporting such data. This chapter focuses on the 
end product — the reports released for public consumption. As part of our 
study, the committee hoped to review prototypes of district-level and 
market-basket reports. NCES provided an example of a district-level report 
that was part of an early draft of technical specifications for below-state 
reporting, and Milwaukee shared with us the report they received as part of 
their participation in a district-level pilot. These reports were presented as 
drafts and examples, not as the definitive formats for district-level reports. 
We reviewed one preliminary mock-up of a market-basket report based on 
simulated data (Johnson, Laser, & O’Sullivan, 1997). Since ETS is cur- 
rendy designing reports as part of the second year of the year 2000 pilot 
project on market-basket reporting, much of the decision making about 
market-basket reports has not yet occurred. Given the stage of the work 
on district-level and market-basket reporting, we present the following dis- 
cussion to assist NAEP’s sponsors with the design of the reports. 

This chapter begins with a review and description of some problems 
cited with regard to the presentations of NAEP data. For this review, we 
relied on the work of a number of researchers, specifically, Hambleton and 
Slater (1995); Wainer (1997); and Jaeger (1998); Wainer, Hambleton, & 
Meara (1999); and Hambleton & Meara (2000). The next section pro- 
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vides commentary on report samples reviewed during the study. The docu- 
ments reviewed include the following: 

1. Draft Guidelines and Technical Specifications for the Conduct of 
Assessments Below-State Level NAEP Testing NCES, August 1995, 
Draft, which included a mock-up of a report for a district 
(National Center for Education Statistics, 1995). 

2. NAEP 1996 Science Report for Milwaukee Public Schools, Grade 
8, Findings from a special study of the National Assessment of 
Educational Progress (Educational Testing Service, 1997b) 

3. NAEP 1996 Mathematics Report for Milwaukee Public Schools, 
Grade 8, Findings from a special study of the National Assessment 
of Educational Progress (Educational Testing Service, 1997a) 

4. Sample market-basket report based on simulated data (Johnson, 
Lazer, & O’Sullivan, 1997) 

5. NAEP's Year 2000 Market-Basket Study : What Do We Expect to 
LearrR (Mazzeo, 2000) 

The chapter concludes with additional suggestions for enhancing the 
accessibility and comprehensibility of NAEP reports. To assist in the de- 
sign of future reports, we encourage the application of procedures to make 
the data more useable, including user- and needs-assessment, heuristic 
evaluation, and actual usability testing. In the appendix to this report, we 
provide an example of how these techniques might be applied. 

CRITIQUES OF NAEP DATA DISPLAYS 

To date, a number of concerns with the accessibility and comprehensi- 
bility of NAEP reports have been described. The most consistent concerns 
are discussed below. 



High-Level Knowledge of Statistics Is Assumed 

Reports assume an inappropriately high level of statistical knowledge 
for even well-educated lay audiences. There are too many technical terms, 
symbols, and concepts required to understand the message of even rela- 
tively simple data, such as mean test scores as a function of time or location. 
In interviews assessing policy makers’, educational administrators’ and me- 
dia representatives’ understanding of NAEP reports, Hambleton and Slater 
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(1995) reported that 42 percent did not understand the meaning of “statis- 
tically significant.” Even relatively basic mathematical symbols are the 
source of some misunderstanding. For example, roughly one-third of those 
interviewed by Hambleton and Slater did not understand the meaning of 
the V and V symbols that were used to indicate a reliable increase or 
decrease in mean scores. 



Information Overload and Report Density 

In an attempt to be complete, reports may present too much informa- 
tion, making it difficult for readers to find and extract what they really 
want to know. Wainer (1997a) described this problem in detail with respect 
to NAEP tables, but the same arguments would hold for other formats as 
well. Reports also often contain overly dense displays that readers find 
daunting. This problem deals with readers’ perceptions of ease of access. 
Designers of textbooks and other technical documents have learned that 
reports can be designed to appear more or less difficult to understand just 
by varying simple report features such as the amount and placement of 
“white space” on the page. In addition to ensuring that reports are easy to 
understand, care must be taken to make reports look easy to understand. 



Attempts at Redesign Have Increased “Clutter” 

When displays are redesigned for easy access, design devices are some- 
times used that undermine this objective through increased clutter or per- 
ceptual inaccuracies. That is, designers can go too far in their attempts to 
make data appear more enticing. A case in point is the use of three-dimen- 
sional renderings of data, where line graphs become cliffs, and pie charts 
become floating discs. Three-dimensional renderings are inherendy am- 
biguous when the information to be extracted involves relative size judg- 
ments of parts, such as, the relative heights of two bars in a three-dimen- 
sional bar graph. So, while attempts should be focused on making data 
reports appear more accessible, concurrent design reviews should ensure 
that comprehensibility is not compromised. 



Unnecessary Mental Arithmetic Is Required 

Reports sometimes require readers to perform unnecessary mental 
steps, including unreliable mental arithmetic, to derive information most 
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relevant to them. For example, change scores across NAEP administrations 
may be as important to most readers as the absolute mean scores at each 
individual administration. Mistakes in mental arithmetic can easily lead to 
incorrect interpretations, even among readers who understand the meaning 
of the presented data. 



Graphics Are Infrequently Used 

Reports do not make enough use of graphical alternatives to textual 
and tabular formats. Associated with both the actual and perceived com- 
plexity issues noted above, reports use vast tables of numbers more fre- 
quently than necessary. Some researchers (e.g., Wainer, 1997a; Wainer et 
al., 1999) argue that, in many cases, graphical displays are more appropriate 
than tables. In an experimental study comparing redesigned NAEP data 
displays, many of which were graphs, with traditional NAEP displays con- 
sisting primarily of tables, Wainer demonstrated that the graphical formats 
promote more rapid and accurate interpretations (Wainer et al., 1999). 

CONCLUSION 6-1: Enhancements to the design of NAEP 
reports that allow for communication to a broader audience 
are a way to increase the utility of these tests, independent of 
changes to the methods used to collect and analyze the actual 
data. The data currently available can be made more acces- 
sible, comprehensible, and relevant. 

REVIEW OF SAMPLE DISTRICT-LEVEL AND 
MARKET-BASKET REPORTS 

District-Level Reports 

NCES’ Specifications for Below-state Reporting (National Center for 
Education Statistics, 1995), still considered a draft document, included a 
report summarizing results for one of the “naturally occurring” districts. 
This report was in tabular format and included means, standard deviations, 
quartiles, and percents at or above each achievement level. Data were re- 
ported for test takers grouped by gender, ethnicity, parents’ educational 
level, type of location, Tide I participation, and eligibility status in the 
school lunch program. Very basic (and somewhat cryptic) interpretive 
information described the grouping categories and the statistics reported. 
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The reports prepared for the Milwaukee Public School system con- 
sisted entirely of tables accompanied by detailed explanatory text. To 
enable comparisons, the tables included results for Milwaukee, Wisconsin, 
and the United States. The report contained numerous two-way tables that 
presented mean scaled scores for test takers grouped by demographic (e.g., 
gender, ethnicity, parental education), school environment (e.g., parental 
support, absenteeism, availability of classroom resources), and classroom 
characteristics (e.g., amount of homework assigned, availability of comput- 
ers). Appendices provided guidance on grouping categories and on the re- 
ported statistics. 



Critique of District-Level Reports 

To begin our review, we compared the sample district-level reports, 
particularly those prepared for Milwaukee, with some of the standard 
NAEP reports. Although the district-level efforts attempted to make the 
reports more readable, while limiting misinterpretations, there is still sub- 
stantial room for improvement. 

The most salient deficiency in both reports is the proliferation of tables. 
Much of the data could be relayed succinctly in graphical form, yet none 
were used. If we were allowed to make only one suggestion about NAEP 
reporting, it would be to use graphical rather than tabular formats when- 
ever feasible, even when displaying relatively few data values (Carswell & 
Ramzy, 1997). 

The use of graphical formats will help address many of the other prob- 
lems associated with previous NAEP reports, including information over- 
load and readers’ perceptions that the reports are difficult to read. One of 
the important ways that graphs can reduce overload is by showing relations 
among display elements, called “emergent features,” to allow the reader to 
draw conclusions without having to hold and manipulate numerical infor- 
mation in their working memory (Bennet & Flach, 1992). For example, a 
graph with three lines could be used to portray the trends in the relation- 
ships between NAEP scale scores and the amount of daily homework 
students complete for the United States, Wisconsin, and Milwaukee. One 
line would show the relationship of homework and NAEP scores for the 
city, another line for the state, and a third for the nation. The direction of 
the slopes of the lines, and the relationships among the lines (for example, 
fanning out vs. parallel) can be recognized very rapidly. These emergent 
features can be used to evaluate relationships among the data for different 
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groups. For example, the relationships between amount of homework as- 
signed and NAEP performance can readily be compared for Milwaukee 
versus Wisconsin students and versus the nation. 

The amount of information presented in individual data displays is a 
concern for the samples in the below-state technical specifications. The 
tables reporting achievement-level percentages include seven columns 
which, based on current knowledge about working memory constraints, is 
probably about three columns too many. It will be difficult for people to 
read the table and keep track of which column they are reading while mov- 
ing down the page, at least without resorting to annoying and error-prone 
visual scanning to reread the column headings. 

Although the Milwaukee report limited most of its tables to between 
three and five columns, the actual range was from two to seven. While this 
streamlining aids the readability of individual tables, it adds to the size of 
the overall report and may make it difficult for some readers to find specific 
information spread over multiple tables and pages. This potential problem 
points to the importance of ascertaining users’ information needs and pri- 
orities during the early stages of report design. For example, if the home- 
work and test score relationships are of greater interest than the relationship 
between calculator use and test scores, then the homework table should be 
given priority of position in the report. Determination of the information 
to be combined in a single display should be based on the types of ques- 
tions readers tend to ask of the data. Again, it should be noted that the use 
of graphs rather than tables may allow more variables to be combined in a 
single display without overloading the reader. 

Finally, the language of the reports we reviewed still overestimates the 
statistical expertise of its audience. For example, in the below-state report 
specifications, column headings included “n,” “cv,” and “< basic.” Recall 
that Hambleton and Slater (1995) found that only about one-third of their 
subjects understood the use of “<” and “>” symbols. The “cv” is likely to be 
beyond the grasp of most readers, and the “n,” though possibly familiar to 
undergraduates enrolled in a statistics course, is probably a vague memory, 
at best, for most people. The Milwaukee reports avoided many of these 
problems by reporting mainly percentages and average-scale scores. How- 
ever, they did report scale scores by selected percentiles (percent at each 
quartile), which may not be widely understood. 

The Milwaukee reports also provided brief textual interpretations di- 
rectly above each table. Some interpretations were provided to ensure that 
readers did not focus too heavily on small, statistically unreliable differ- 
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ences; other interpretations were simply overviews of table content. In 
general, these brief text inserts are likely to be useful to people searching for 
specific kinds of information or who may be unfamiliar with inferential 
statistics and associated notations. However, the writers of these inserts 
must take care in selecting their terminology and in avoiding the special- 
ized statistical usage of terms such as “significant” in describing results. 



Market-Basket Reports 

Work on designing market-basket reports is still in its earliest stage. As 
part of market-basket preliminary research, Johnson and colleagues (1997) 
provided a sample report based on simulated data. Reactions to this report 
were obtained during the committees workshop on market-basket report- 
ing. The mock-up appears below. 

Table 6-1 displays percent correct results for test takers in fourth, eighth 
and twelfth grades. Column 2 presents the overall average percent correct 
for test takers in each grade. Column 3 shows the percent correct scores for 
each achievement-level category associated with the minimum score 
cutpoint for the category. For example, the cutpoint for the fourth-grade 
advanced category would be associated with a score of 80 percent correct. 
A score of 33 percent correct would represent performance at the cutpoint 
for twelfth-grade’s basic category. 



TABLE 6-1 Example of Market-Basket Results* 



( 1 ) ( 2 ) ( 3 ) 

Average Percent Cut Points by Achievement Level 
Grade Correct Score* 

Advanced Proficient Basic 



4 


41 % 


80 % 


58 % 


34 % 


8 


42 % 


73 % 


55 % 


37 % 


12 


40 % 


75 % 


57 % 


33 % 



*Data in Table 6-1 are based on simulations from the full NAEP 
assessment; results for a market basket might differ depending on 
its composition. 



*In terms of total possible points 



104 



94 



NAEP REPORTING PRACTICES 



Comments on this report were mixed, especially given that it was pre- 
sented as a mock-up and not as a prototype for market-basket reporting. 
The primary concerns related to substantive issues, specifically the percent 
correct scores that would be associated with the achievement level descrip- 
tors (e.g., 55 percent correct would represent a proficient level). Given this 
concern, it would be essential to provide explanatory text documenting the 
meaning of the various achievement level descriptors. 

Further design of market-basket reports is an ongoing part of ETS s 
pilot study. The year 2000 study is expected to yield two type of reports: 
(1) a research report intended for technical audiences that examines test 
development and data analytic issues associated with the implementation 
of market-basket reporting, and (2) a report intended for general audi- 
ences. According to Mazzeo (2000), some of the features being explored 
include 



• National and state-level NAEP results (average scores and achieve- 
ment level percentages) expressed in a market-basket metric (e.g. 
percent correct). Such results could be confined to “total-group” 
scores or could be extended to include national and state results by 
gender, race/ ethnicity, parental education, and other standard 
NAEP reporting groups. 

• All, or a sample, of the items that make up the short form as well 
as performance data. The text of the items, scoring rubrics, and 
sample student responses might also be provided. 

• A format and writing style appropriate for a general public 
audience. 

• Electronic reporting. 

Pilot study plans call for focus groups to be conducted during the second 
year to obtain feedback on the report designs. Because report design is in 
the early development stage and actual prototypic reports are unavailable, 
we next discuss methods for designing reports to assist NAEP s sponsors 
with this process. 
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TOWARD COMPREHENSIBLE AND ACCESSIBLE 
DISTRICT-LEVEL AND MARKET-BASKET REPORTS: 

THE ARGUMENT FOR FORMAL USABILITY AUDITS 

Current Practice 

NCES and NAGB have recognized the need for more attention to the 
public “face” of NAEP reports, funding research on readers’ responses to 
and understanding of current reports (Jaeger, 1998; Hambleton & Meara, 
2000). However, the design reviews and modifications necessary to address 
the comprehensibility and accessibility issues raised by this research remain 
fairly informal and unsystematic. 

NAGB has encouraged NCES to redirect NAEP reports to the general 
public and away from more technical audiences (Bourque, personal com- 
munication, April 2000). For example, in 1992, NAGB adopted resolu- 
tions calling for achievement levels as the primary way of reporting NAEP 
data, believing that achievement levels are more understandable to the pub- 
lic than the traditional scale scores. In addition, a separate NAGB resolu- 
tion resulted in the relocation of standard errors — of most interest to the 
technical community and less so to the public — to the appendices of 
reports. However, such changes appear to be based on the opinions of 
board members through NAGB’s Dissemination and Reporting Commit- 
tee, rather than on results from formal usability audits or tests. Although 
NAEP reports go through NCES departmental reviews and adjudication, 
it is not current practice to require that a usability expert be a part of the 
review process. 



Suggested Practice 

One way to bring the concerns of accessibility and comprehensibility 
into the design and review process for NAEP reports is through the appli- 
cation of a number of “usability engineering” methods. These methods, 
which have been applied extensively to consumer product and electronic 
information design, rely on user-centered feedback and user participation 
in all phases of development (e.g., Neilsen, 1993; Norman, 1988; Rubin, 
1994). Box 6-1 illustrates user-centered design strategies that might be 
applied to the development and revision of NAEP reports. 

After defining the “mission” of the report by incorporating directives, 
constraints (e.g., costs, time lines), and program requirements, an in-depth 
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BOX 6-1 

Example of design heuristics for evaluating 
the usability of data displays 

(1) Is the format compatible with the performance criterion 
selected? If speed of finding and reporting information is more im- 
portant than absolute accuracy, then graphical or more holistic dis- 
plays should generally be used. If accuracy of retrieval of precise 
values is the goal, a tabular display may be required. 

(2) Is the structure of the display compatible with the structure 
of the data? If the data structure has been described prior to 
choosing a display, then the data structure should determine the 
format. For example, periodic or cyclic time trends should be pre- 
sented on a polar plot and linear trends should be presented in the 
form of a line graph. 

(3) Is the perceptual grouping of information compatible with 
the mental grouping users must perform to extract the infor- 
mation they want and need? Given data from the user needs 
assessment, are the data values necessary for the most important 
comparison or integration grouped most strongly (i.e., associated 
by the greatest number of gestalt grouping principles such as spa- 
tial proximity, similarity, connectedness, and enclosure)? Are infor- 
mation values that are rarely combined isolated from one another? 

(4) is the level of numeric detail compatible with the reliability 
of the data and the needs of the reader? Reporting of decimal 
places should be reduced to the minimum necessary for the task at 
hand, as unnecessary precision results in increased reading time 
and reduced discriminability among numbers (and increased po- 
tential for error). 

(5) is data salience compatible with data importance? One of 

the purposes of some data displays is to direct the reader's atten- 
tion. Because involuntary shifts of attention are induced by dissimi- 
larity (e.g., a red pie chart in a table filled with blue numbers), make 
certain that the most dissimilar or incongruent features of the visual 
array represent information of genuine importance (based on the 
results of data analysis or on the interests of the users). 

(6) Is the data display compatible with working memory limits? 
Working memory refers to two fundamental phenomena that all 
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humans experience. The first phenomenon is that people retain 
their immediate thoughts only until other thoughts displace them. 
New thoughts displace old thoughts because working memory can 
only hold so much information at a given time. In general, indi- 
vidual displays should include no more than four organizational 
“objects” that must be used in conjunction (e.g., lines in a graph, 
columns in a table, or footnote identifier in either type of display). 
In addition, information to be used in conjunction should be placed 
together, so that one piece of information does not have to be held 
in working memory while the reader is looking for the information 
with which to integrate it. 

(7) Are physical properties of the stimuli compatible with our 
ability to detect, discriminate, and recognize these properties? 

Does the physical difference in the height of two bars or the slope of 
two lines exceed the minimum necessary to result in a perceptual 
just-noticeable-difference (JND)? Are data values that need to be 
compared presented, where possible, as points along common 
scales? If points along common scales cannot be used, then are 
physical dimensions chosen from as near the front of the following 
as possible — lengths, angles and slopes, volumes, lightness/dark- 
ness, and hue? If users must precisely identify a visual element 
from among a small set of alternatives (e.g., the color of a line that 
represents the data collected from the far western states rather 
than the Northeast, Midwest, or South), then different dimensions 
should be combined redundantly to aid identification and to maxi- 
mize dissimilarity. 

(8) Is the organization of information in the display compatible 
with spatial metaphors and population stereotypes? Are bet- 
ter scores represented as "higher” scores (e.g., by graphing num- 
ber correct rather than number of errors)? Are more recent scores 
reported to the right of earlier scores? Are lines or bars represent- 
ing more southern geographic regions represented by “warmer” 
colors? 

(9) Is the choice of display format and ornamentation compat- 
ible with the users* preferences and biases? Three-dimensional 
displays should be avoided when showing controversial results, 
since readers find two-dimensional displays more “trustworthy.” 
Use bar graphs instead of line graphs when readers are likely to be 
intimidated by statistical displays. 
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study is needed to identify the target audience and their likely information 
needs. This is the stage of user-needs analysis , an aspect of NAEP design 
both in terms of test construction and reporting that seems to be somewhat 
neglected. As we have emphasized elsewhere in this report, we need to 
know exactly who is interested in district-level and market-basket NAEP 
data, as well as who is interested in current NAEP data. It will also be 
necessary to determine users’ expectations of what information can be 
gleaned from the reports; gauge their level of statistical sophistication and 
experience with educational test data; and elicit information about their 
experiences, from which guiding metaphors might be derived to aid in 
translating test data into more understandable concepts. 

This information can then be translated into a series of user require- 
ments. For example, these requirements should include a list of statistical 
terms or concepts that the users can be expected to know and a list of 
terms and concepts likely to be misunderstood... Likewise, the require- 
ments could indicate the minimum reading level of likely users. After 
gaining information about the users interests and expectations, a list of 
“most important questions” can also be generated to inform the selection 
and ordering of specific data displays in the reports. Knowledge about the 
users’ educational and work histories might provide suggestions for appro- 
priate data metaphors, for example, use of sports statistics rather than eco- 
nomic indices. 

With the user requirements identified, report designers can create 
mock-ups of entire reports and component displays. These mock-ups can 
use past data or “dummy” data to increase their realism. The mock-ups 
should then undergo heuristic evaluations in which a usability specialist 
checks the designs against a list of empirically established guidelines for 
reducing effort, time, and errors in the reading of data displays. Box 6-1 
provides one example of a set of such heuristics. However, there are addi- 
tional guidelines available, such as those described by Jaeger (1998), Pickle 
and Herrmann for statistical maps (1994), Wainer (1997a) for tables, 
Spence & Lewandowsky (1989), Kosslyn (1994), and Cleveland (1985), 
and Gillian, Wickens, Hollands, & Carswell (1998) for graphs. 

It is important when choosing and using heuristics for early and rapid 
usability reviews that care be taken to select scientifically validated heuris- 
tics (Herrmann & Pickle, 1996; Kosslyn, 1985; Simkin & Hastie, 1987; 
Tversky & Schiano, 1989) that are not simply the result of design lore or 
convention. That is, care should be taken to ensure that the science of 
human cognition and comprehension informs the art of NAEP reporting. 
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Suggestions made during the heuristic evaluation can be used to modify 
the overall report layout or the design of specific displays. At this point, 
actual usability testing becomes essential. Wainer (1997a) provides an 
excellent example of this step in the review process. In his study, a sample 
of potential users answered questions about NAEP data while viewing origi- 
nal and revised data displays. The user-subjects were also timed and probed 
for their preferences. In the Wainer study, most of the revised displays led 
to better performance and were preferred. However, there were some 
exceptions, which should lead to additional design revision or to the recon- 
sideration of the original design for the final report. 

Once the reports are produced and distributed, further usability analy- 
ses can be made on the actual use of the reports (e.g., citations, requests for 
copies) and on misuses made of the data (overgeneralizations, errors in 
interpretation). This information can be integrated into the next user- 
needs analysis before the next round of NAEP data is published. 

Previous critiques of NAEP report design (Jaeger, 1998) have suggested 
a number of these components in isolation, such as market research to 
determine user expectations and field testing to review actual usability. 
Focus groups, like those conducted by Hartka and Stancavage (1994) dur- 
ing evaluations of the Trial State Assessment, provide examples. We suggest 
that these processes should be applied to the development of the reports 
issued to NAEP s audience in connection with district-level reporting and 
the design of market -basket reports. In the appendix to this report, we 
provide an example of how a usability process might work. 



Drawing on Appropriate Imagery 

The issue of defining appropriate metaphors to enhance report com- 
prehension is particularly important when considering market-basket style 
reports. The model that has been used for market-basket reporting is the 
CPI (Forsyth et al., 1996). For communicating information about fluctua- 
tions in the price of consumer goods, the image of an actual market basket 
is both appropriate and very familiar to consumers. However, a market 
basket is an odd, even jarring image in the context of educational achieve- 
ment. Most people probably do not view education as a consumer pur- 
chase, nor are they likely to perceive it as an assortment of independent 
parcels placed in a shopping cart. The question, however, is what meta- 
phor should replace the market basket in representing a composite report- 
ing statistic of NAEP performance? Again, the user-needs analysis is the 
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appropriate forum for determining the most direct or evocative metaphor, 
be it a “report card,” a “GPA,” or some sort of educational “batting average.” 

CONCLUSIONS AND RECOMMENDATIONS 

Given the amount of attention that below-state results would be likely 
to receive, significant time and effort should be devoted to product design. 
The design of data displays should be carefully reviewed and should evolve 
through methodical processes to consider the purposes the data might serve, 
the needs of users, the types of interpretations, and anticipated types of 
misinterpretations. Any imagery used to describe reports should be based 
on metaphors that evoke appropriate images for educational data. User- 
needs analysis is the appropriate forum for determining both product de- 
sign and effective metaphors for aiding in communication. 

RECOMMENDATION 6-1: Appropriate user profiles and 
needs assessments should be considered as part of the inte- 
grated design of district-level and market-basket reports. The 
integration of usability as part of the overall design process is 
essential because it considers the information needs of the 
public. 

RECOMMENDATION 6-2: The text, graphs, and tables of 
reports developed for market-basket or district-level reporting 
should be subjected to standard usability engineering tech- 
niques including appropriate usability testing methodologies. 

The purpose of such procedures would be to make reports 
more comprehensible to their readers and more accessible to 
their target audiences. 
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7 

Implications of District-Level and 
Market-Basket Reporting 



The two reporting practices that are the subject of this study represent 
more than extensions of current NAEP programs and procedures: they are 
essentially new programs that would result in new NAEP products. Both 
reporting methods would present new information that would draw atten- 
tion from new audiences — audiences that, in the past, may have paid little 
attention to NAEP results. Implementation of either reporting method 
would pose challenges for NAEP s existing procedures. District-level re- 
porting would affect sampling procedures. Creation of a short form of 
NAEP has implications for test construction procedures. Both market- 
basket and district-level reporting would alter analytic and scoring method- 
ologies as well as the number and types of reports to be prepared. Given 
these factors, implementation of either reporting practice can be expected 
to have a significant impact on the internal configuration of the NAEP 
program. Furthermore, the use of data resulting from these reporting meth- 
ods by policy makers, state and local departments of education, the press, 
and the lay public could carry consequences for state and local assessment, 
curriculum, and instruction. 

In this chapter, we address questions about the consequences that the 
two reporting practices might have, specifically: (1) Would either district- 
level or market -basket reporting pose any threats to the validity of infer- 
ences from national and state NAEP? and (2) What are the implications of 
district-level and market-basket reporting for other state and local assess- 
ment programs? In the first section of this chapter, we explore the likely 
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implications of district-level and market-basket reporting on the NAEP 
program. In the second section, we discuss the impact of the reporting 
practices on state and local educational systems. 

IMPLICATIONS FOR THE NAEP PROGRAM 

NAEP is comprised of many interrelated components that work 
together to form a complex system. A change to any given piece of this 
system may have consequences for other pieces of the system. Implementa- 
tion of either district-level or market-basket reporting would require 
numerous changes. 

First, the type and nature of reported data will influence NAEP s 
sampling and analytic methodologies. Different sampling procedures 
would be needed to allow reporting of district-level data. Different analytic 
procedures would be needed to condition on district characteristics rather 
than state characteristics. 

Second, the types and numbers of reports required will affect the com- 
plexity and length of time for production. Under district-level reporting, 
the number of reports produced could increase significantly. Preparation 
of market-basket results based on synthetic forms would introduce signifi- 
cant complexity. 

Third, the uses made of reported data will affect the relative impor- 
tance of the assessment in schools and the ways schools and students pre- 
pare for the assessment. Such changes suggest the need for additional user 
support and interpretive guidance. Policy would need to be formulated to 
guide preparation activities. 

Hence, changes cannot be enacted capriciously but must be consid- 
ered in relation to their potential effects on other pieces of the system. In 
the text below, we expand on this by exploring some of the effects the 
proposed reporting practices might have on the validity of inferences drawn 
from NAEP results as well as on NAEP s procedures, policies, and program 
costs. 



Increasing the Stakes 

Traditionally, NAEP has been a low-stakes assessment, since decisions 
about schools, teachers, and individuals have not been based on test results. 
The move to reporting data for school districts — either via current NAEP 
or through the short form — brings the level of reporting much closer to 
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those responsible for instruction. As the level of reporting moves to these 
smaller units, the assessment stakes will likely become even higher for 
schools and teachers. Increasing the stakes can have a myriad of effects. 

First and foremost, increasing the stakes would require immediate at- 
tention to security issues. If high stakes consequences were attached to dis- 
trict-level performance on current NAEP or based on the short form, the 
likelihood of security breaches would increase. Security breaches could com- 
promise NAEP items as well as the items that make up the short form. In 
anticipation of increased potential for security breaches, item development 
would need to be stepped up. Furthermore, with higher stakes, test prepa- 
ration activities would become more of a concern, since inappropriate test 
preparation practices could unfairly advantage some districts and could af- 
fect the validity and integrity of test results. As suggested by Roeber (1994), 
NAEP’s sponsors would need to lay out appropriate and inappropriate test 
preparation procedures. 

Higher stakes also increases motivation to perform well. Currently, 
students have little incentive to do well on NAEP beyond their own per- 
sonal pride and exhortations to honor the state. But if districts were able to 
obtain results (either as part of current NAEP or via the short form), schools 
and students might demonstrate greater motivation to perform well on the 
assessment. Previous research examining the effects of motivation on NAEP 
performance suggested that changes in motivation may be associated with 
increased performance (Linn, Koretz, & Baker, 1996). For example, 
Kiplinger and Linn (1992; 1995/1996) studied changes in performance on 
NAEP items when a block of NAEP mathematics items was embedded in a 
state assessment used for state and local school accountability purposes; 
presumably, schools and students are more motivated to perform well on a 
test used for accountability purposes. Their studies found a small, but 
statistically significant, effect, suggesting that students performed better on 
the NAEP items administered as part of the state assessment than on the 
same items administered as part of NAEP. 

If motivation to do well can affect students’ performance, then a num- 
ber of issues may arise. Performance on NAEP may increase — perhaps not 
as a result of increased skill levels but as a result of increased motivation to 
demonstrate skill levels. This can degrade the integrity of NAEP as a moni- 
tor of educational progress. For example, under district-level reporting for 
current NAEP, performance gains could be seen in districts that receive 
results, thereby improving performance for the state. States that have no 
districts qualifying to receive results may not realize similar gains. It would 
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be impossible to discern whether performance increases represent real skill- 
level changes or are only an artifact of changes in motivation. 

If plans for the short form were implemented, changes in motivation 
could further affect the comparability of short-form results to regular NAEP 
results. Depending on the ways schools and districts decide to use short- 
form results, motivation to do well may increase. These changes in motiva- 
tion will interfere with hopes that short-form results would be able to com- 
pare with main NAEP. 



Interpreting Reported Data 

Although these reporting approaches have been suggested as ways of 
making NAEP reports simpler and more interpretable, they may add com- 
plexities that require additional clarification. Below-state reporting may 
attract new audiences, unfamiliar with the goals, purposes, and limitations 
of NAEP. Such audiences would require assistance in understanding the 
meanings and implications of NAEP results. NAEP’s sponsors could find 
themselves faced with providing support materials to new and different 
users to ensure appropriate interpretations of results. 

Use of a percent correct metric for market-basket reporting would 
require considerable support to prevent misinterpretation, even for experi- 
enced users of NAEP results. For instance, during the committee s work- 
shop on market-basket reporting, several speakers cautioned that the per- 
cent correct scale proposed for use with the market basket (see Table 6-1) 
differs from the way the public generally views percent correct scores. A 
number of speakers commented that people typically regard 70 percent as a 
passing score; scores around 80 percent as indicating proficiency; and scores 
of 90 percent and above as advanced. What would members of the general 
public think when they saw that the average American student scored less 
than 50 percent on the test? Or, that the proficient student only answered 
55 percent of the questions correctly? According to one assessment direc- 
tor, “Most test directors [know enough about NAEP to] understand why 
this might be, but no teacher, parent, or member of the public would con- 
sider 55 percent proficient. They would consider that score as representing 
‘clueless,’ perhaps, and would think even less of the test and the educators 
that would purport to pass off 55 percent as proficient” (National Research 
Council, 2000). NAEPs sponsors may find that explaining percent correct 
scores would require substantial interpretive support to their various audi- 
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Demand for School and Individual Results 

Availability of the short form would fuel the demand for official indi- 
vidual scores as it is likely that the released short forms will be posted on 
websites, and audiences will be encouraged to “take the test” and get a score 
(Colvin, 2000). Because the short form could be administered to all chil- 
dren in a specific grade in a manner closely resembling other testing in 
schools — testing that results in individual score reports — maintaining the 
prohibition against individual results will be difficult. 

District-level reporting may increase the expectation for school and 
student level results as well. Instructionally useful information about con- 
tent areas within a subject — for example, geometry and algebra scores, 
rather than simply overall mathematics scores — is typically available to dis- 
tricts as part of other testing programs and may also become an expectation 
for NAEP. 



Participation in State or Main NAEP 

Participation in NAEP may be affected both positively and negatively 
by the proposed new reporting practices. Assuming resolution of the many 
technical and logistical issues related to district-level reporting and that few 
negative consequences are associated with performance, participation in 
state or main NAEP may increase. Districts may be willing to invest stu- 
dent and teacher time in return for data they consider useful. 

For market-basket reporting via the short form, the impact may be the 
opposite. If districts are able to receive information more quickly with less 
testing time, they may opt for the use of the short form in place of partici- 
pating in state or main NAEP. 



Increased Program Costs 

Moving to either of the proposed reporting methods would have sig- 
nificant cost implications. Increased item development would be needed — 
due to the security considerations associated with district-level reporting, 
the number of items released as part of the market basket, and the items 
needed to construct short forms. Larger numbers of students would be 
tested to accommodate reporting district-level results, which could sub- 
stantially increase test administration costs. Scoring procedures for both 
reporting practices could also introduce additional complexities, which 
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would increase costs associated with data analyses. Increased numbers of 
reports would be required, since separate reports would be prepared for 
each participating district and to provide market-basket results. NAEP’s 
sponsors would need to provide interpretive support to assist users of the 
new products. Thorough evaluation of the costs associated with the report- 
ing methods is essential. And, if these costs are to be passed on to users 
(either the state or the district), they need to be known and specified prior 
to considering districts’ and states interest in either program. 

IMPLICATIONS FOR STATE AND LOCAL 
EDUCATIONAL SYSTEMS 

States’ and districts’ educational systems vary widely, making it impos- 
sible to characterize in a simple way the role of assessment or the relation- 
ships among assessment, curriculum, and instruction. Traditionally, how- 
ever, assessment either serves an accountability function or as an integral 
component of the larger instructional system. Since assessment is one 
aspect of a system with interrelated parts, changes in assessment systems 
affect curriculum and instruction, as well as what we know about student 
learning. Likewise, changes in curriculum or instruction affect assessment. 

Instructional systems are often initially developed from expectations 
for student learning. These expectations are structured by curricula that 
map essential steps in the development of that learning. Schools imple- 
ment instructional strategies that enable students to reach the identified 
curricular milestones and expectations. Assessment occurs at appropriate 
points in the instructional process to inform decision makers about the 
status of student learning and to provide information for further instruc- 
tional planning. 

In the ideal, each of these components integrally connects to the other 
components of the instructional system. However, there are a myriad of 
factors and influences that can negatively affect the symbiotic relationships 
among the components. Any resulting disconnect between the compo- 
nents can derail student learning, the reporting of learning progress, or the 
instructional planning essential to continued learning. To avoid these dis- 
ruptions, recent educational reforms have focused on the alignment of 
expectations (often called standards), curriculum, instruction, and assess- 
ment. 

This idealized system is subject to influences by public policy, public 
relations, community pressure, and other forces outside of the learning 
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system. These additional forces can produce disconnects among compo- 
nents of the system and can result in inefficiencies that can hamper stu- 
dents’ opportunities to attain the desired expectations. Thus, it is impor- 
tant to consider the possible effects of district-level and market-basket 
reporting on state and local curricula and assessment systems. As with any 
change, the potential implications for local systems of implementing dis- 
trict-level NAEP or market-basket reporting are many and varied. 

For local educational systems, the implications of district-level report- 
ing and market-basket or short-form reporting may parallel those antici- 
pated with the implementation of the state NAEP (see discussion in Chap- 
ter 3), as well as include implications specific to district-level instructional 
systems. The text below discusses the likely effects of the two proposed 
reporting practices on local curricula and assessments. 



Assessment Areas, Content, Schedules, and Methodology 

Currently, many state assessments are administered at about the same 
time of year as national and state NAEP Schedule conflicts have put many 
districts in the position of having to choose between NAEP and state or 
local assessments. When faced with such conflicts, districts have tended to 
withdraw from NAEP participation in order to accommodate the schedule 
for mandated state and local assessments. But if NAEP results were re- 
ported at the district level, there is likely to be more focus on those results. 
This could cause districts or states to favor NAEP participation over their 
local assessment programs. Attempts to ensure that students are not over 
tested or weary at the time of the NAEP testing could lead to changes in 
current assessment schedules as well as modification of current assessment 
systems. 

Data from a high visibility national assessment may receive more at- 
tention than local assessment results. Generally, local curricula and expec- 
tations are closely tied with local and state assessment — but not necessarily 
with NAEP. Comparisons of performance on the two sets of results may 
portray different pictures about students’ accomplishments, differences that 
may be primarily attributable to alignment between local assessment and 
instructional programs. As a result, there might be a push to align instruc- 
tion more closely with what NAEP tests. Or, current assessment systems 
might be replaced by the short form, given the desires and pressures for 
comparisons with national benchmarks. Such changes can potentially dis- 
rupt the instructional and learning systems currently in place. 
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Based on information gathered during the committee-sponsored work- 
shops, it might be expected that local assessments would be influenced by 
the kinds of items and the format of items that are used on NAEP Work- 
shop participants commented that states have found the release of NAEP 
items to be useful in guiding item development for state assessments. For 
example, the use of performance assessments and constructed response 
questions in NAEP has led to the inclusion of similarly formatted ques- 
tions in state instruments. Since the research involved in developing NAEP 
items is often much more extensive than is possible within state research 
divisions, states feel quite comfortable using the NAEP design as a model 
in developing their tests. If district-level reporting were implemented, these 
changes would also be likely for local assessments. The influence of NAEP' 
formats on local assessments may be more pronounced given the number 
of items released in connection with the market basket. This could benefit 
the local systems, but only to the degree that the content to be assessed, the 
testing purposes, and other important characteristics of the test design 
would dictate the use of such item types. A significant disconnect within 
the local system of curriculum, instruction, and assessment could be cre- 
ated if there is insufficient alignment between NAEP and local instruc- 
tional programs. 



Approaches to Reporting Results 

District-level NAEP reports might also have an effect on the type of 
information districts report about their own assessments. To reduce confu- 
sion for the public, districts might choose a single form of reporting. Most 
likely, approaches used for the higher visibility (perceived as the “higher 
priority”) assessment would prevail. Thus, districts may adopt the use of 
NAEP-like achievement levels, scaled scores that appeared consistent with 
NAEP results, as well as certain statistical and other processes. 

This pattern has been seen in statewide assessments. During the 
committees workshops, representatives from state assessment offices com- 
mented that NAEP s use of achievement levels to summarize performance 
has been highly influential . 1 Many states have moved to achievement-level 



l It should also be pointed out that the NAEP achievement levels have been the subject 
of considerable research and debate. Details can be found in National Research Council 
(1999b) and Hambleton, Brennan, Brown, Dodd, Forsyth, Mehrens, Nelhaus, Reckase, 
Rindone, van der Linden, & Zwick (2000). 
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reporting, and some use the same achievement-level descriptors as NAEP. 
This emulation of NAEP may increase confusion. For example, some mis- 
interpretation has been associated with the achievement levels. One work- 
shop participant noted that results from a recent NAEP administration 
revealed that 60 percent of their students performed below the proficient 
level in reading. State legislators interpreted this finding to mean that their 
students lacked essential reading skills (an interpretation not necessarily 
justified by the NAEP results) and advocated for revisions in the state read- 
ing instruction and assessment program. Under the amended system, 
students take an oral reading test in second grade, which allows for early 
identification and remediation of reading problems. Low-performing stu- 
dents then receive an individualized reading program designed to improve 
their reading mastery (National Research Council, 1999c). While the ulti- 
mate result may have benefited low-performing students, the original inter- 
pretation of NAEP results may not have been appropriate. 

There are marked disadvantages associated with percent correct report- 
ing. Percent correct scores may appear simple to understand, but they are 
subject to misinterpretation (See Chapter 4). If NAEP moved to reporting 
percent correct scores on market-basket sets of items, states and districts 
might be expected to consider following suit. Attempting to share the cred- 
ibility of NAEP through applying such reporting approaches to local as- 
sessments would undermine the effectiveness and the appropriateness of 
current approaches to the reporting of results for many local assessments. 

These and other approaches used by NAEP might initially appear 
appropriate for local assessment systems. However, attempts to emulate the 
national assessment in these areas is fraught with obstacles. NAEP’s matrix 
sampling approach, for example, is not appropriate for producing indi- 
vidual student results. The sophistication and complexity of the processes 
that underlie NAEP development, scoring, and reporting would likely be 
inappropriate or unachievable for many local assessments due to various 
factors. These factors include sample size, expertise, and resources at district 
levels, as well as fundamental issues related to the comparability of score 
scales, comparability of achievement levels determined with differing groups 
on differing content using differing procedures, and other technical issues. 

Impact on Curriculum 

The use of district-level and market-basket reports may also have an 
impact on the curricular content taught in schools. With highly visible 
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NAEP results being reported at the district level, via either full NAEP or 
through the short form, there would be some pressure for curriculum to 
become more aligned with those assessment results. By definition, the 
market-basket concept implies a domain being assessed and reported that is 
narrower than the entire NAEP framework, and the short form would be 
an even smaller sample of that domain. The impact on curriculum of re- 
porting at the district level is likely to be significant, due to this narrowed 
focus. The limited set of items would likely reduce the scope of curricular 
expectations, especially in the context of strong public scrutiny. 

Moreover, the market basket might supplant local standards due to 
their perceived priority. Because the market basket is smaller, it may appear 
to some to represent a carefully reasoned set of priorities for learning. And 
because it was developed nationally, the market basket might appear to 
represent a more general consensus about what students should know and 
be able to do than a locally generated set of content and standards. 



Linking Local Results to NAEP 

There might also be attempts to link local level assessment results to 
NAEP s district-level results, again for purposes of reducing confusion in 
interpreting results or for “improving” the comparability between results 
from differing assessments. Workshop participants observed that an appeal- 
ing feature of district-level reporting for NAEP would be the presumed 
ability to compare district assessment results with stable external measures 
of achievement. There are several problems with attempts to link to NAEP. 
Earlier reports published by the National Research Council have indicated 
the problematic nature of attempting or touting such connections (National 
Research Council, 1999a; National Research Council, 1999d). 

CONCLUSIONS AND RECOMMENDATIONS 

Many of the concerns expressed in this chapter parallel those expressed 
when state NAEP was first implemented. Although not all the dire predic- 
tions for state NAEP came true, there is considerable concern over the 
potential uses of district-level and market-basket results. Will district-level 
results be used to rank order districts within the state or across the country? 2 



2 This presupposes that the sampling design and interest levels result in sufficient 
numbers of participating districts to produce a “cross-district data compendium” like the 
cross-state data compendia. 
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Will districts be punished and rewarded for their performance? Will dis- 
trict-level NAEP results become part of schools accountability systems? If 
so, what impact will this have on NAEP? Will NAEP s function as a moni- 
tor of change be fundamentally altered by below-state reporting? What 
effect will the release of market-basket sets of item have on state and local 
instruction systems? Given the potential for varied effects, the same level of 
effort on program evaluation would be called for as was implemented in 
connection with the Trial State Assessment. In addition, support systems 
will be needed to assist states and districts in appropriate uses and interpre- 
tations of the new products and reports. 

RECOMMENDATION 7-1. If the decision is made to pro- 
ceed with district-level reporting, NAEP’s sponsors should de- 
velop and implement a plan for program evaluation, similar 
to the research conducted during the initial years of the Trial 
State Assessment, that would investigate the quality and util- 
ity of district-level NAEP data. 

RECOMMENDATION 7-2: The potential is high for signifi- 
cant impact on curriculum and/or assessment at the local 
levels. If either district-level reporting or market-basket 
reporting, with or without a short form, is planned for imple- 
mentation, the program sponsors should develop and imple- 
ment intensive support systems to assist districts and states in 
appropriate uses and interpretations of any such NAEP results 
reported. 
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APPENDIX 

A 

Background and Current Uses of the 
Consumer Price Index 



The CPI is a measure of the average change over time in the prices paid 
by urban consumers in the United States for a fixed basket of goods in a 
fixed geographic area. The CPI was developed during World War I so that 
the federal government could establish cost-of-living adjustments for work- 
ers in shipbuilding centers. Rapid increases in prices had made such an 
index necessary for calculating these adjustments. 

Today, the CPI is the principal source of information concerning 
trends in consumer prices and inflation in the United States. It is widely 
used as an economic indicator and a means of adjusting other economic 
series (e.g., retail sales, hourly earnings) and dollar values used in govern- 
ment programs. The CPI is used to adjust payments to Social Security 
recipients and to Federal and military retirees, and for a number of entide- 
ment programs such as food stamps and school lunches. Also, individual 
income tax brackets and personal exemptions are adjusted for inflation 
using the CPI. The index’s impact on the finances of the federal govern- 
ment is significant. In fiscal year 1996, for example, the Office of Man- 
agement and Budget estimated that each one-percent increase in the CPI 
produced a $5.7 billion increase in oudays and a $2.5 billion decline in 
revenues. In addition, as the most widely used index for measuring infla- 
tion, the CPI aids in the formulation of fiscal and monetary policies and in 
economic decision-making. 

The CPI measures the rates of changes in prices, not their absolute 
levels. Most of the specific CPI indexes have a 1982-84 reference base. 
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That is, the average price level for the 36-month period covering these years 
is established as having an index level of 100. A 10-percent increase in 
price since this reference period would then correspond to an index level of 
110 . 

The Bureau of Labor Statistics currendy produces two national indices 
every month: the CPI for All Urban Consumers (CPI-U) and the more 
narrowly based CPI for Urban Wage Earners and Clerical Workers (CPI- 
W), which is developed using only data from households represented in 
certain occupations. In addition to monthly release of the national CPI 
estimates, the BLS publishes monthly indexes for the four principal regions 
of the nation (Northeast, Midwest, South, and West), as well as for collec- 
tive urban areas classified by population size. The BLS also publishes in- 
dexes for 26 local areas on monthly, bimonthly, or semiannual schedules. 
An individual area index measures how much prices have changed over a 
specific time interval in that particular area. However, because of the na- 
ture of the index and the specifics of the sampling design, indexes cannot 
be used for relative comparisons of the level of prices or the cost of living in 
different geographic areas. In fact, the compositions of the regional market 
baskets generally vary substantially across areas because of differences in 
purchasing patterns. 



COLLECTION OF DATA ON CONSUMER EXPENDITURES 

The BLS develops the CPI market basket on the basis of detailed infor- 
mation provided by families and individuals about their actual purchases. 
Information on purchases is gathered from households in the Consumer 
Expenditure (CE) Survey, which consists of two components: an interview 
survey and a diary survey. 1 Each component has its own questionnaire and 
sample. 

In the quarterly interview portion of the CE survey, an interviewer 
visits every consumer in the sample every 3 months over a 12-month pe- 
riod. The CE interview survey is designed to collect data on the types of 
expenditures that respondents can be expected to recall for a period of 3 
months or longer. These expenditures include major purchases, such as 



^uch of the material in this section is excerpted from Appendix B of Consumer Expen- 
diture Survey, 1996-97, Report 935, Bureau of Labor Statistics, September 1999. 
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property, automobiles, and major appliances, and expenses that occur on a 
regular basis, such as rent, insurance premiums, and utilities. Expenditures 
incurred on trips are also reported in this survey. The CE interview survey 
thus collects detailed data on 60 to 70 percent of total household expendi- 
tures. Global estimates — i.e., expense patterns for a 3-month period — are 
obtained for food and other selected items, accounting for an additional 20 
percent to 23 percent of total household expenditures. 

In the diary component of the CE survey, consumers are asked to main- 
tain a complete record of expenses for two consecutive one-week periods. 
The CE diary survey was designed to obtain detailed data on frequendy 
purchased small items, including food and beverages (both at home and in 
eating places), tobacco, housekeeping supplies, nonprescription drugs, and 
personal care products and services. Respondents are less likely to recall 
such items over long periods. Integrating data from the interview and 
diary surveys thus provides a complete accounting of expenditures and in- 
come. 

Both the interview and diary surveys collect data on household charac- 
teristics and income. Data on household characteristics are used to deter- 
mine the eligibility of the family for inclusion in the population covered by 
the Consumer Price Index, to classify families for purposes of analysis, and 
to adjust for nonresponse by families who do not complete the survey. 
Household demographic characteristics are also used to integrate data from 
the interview and diary components. 

Samples for both the interview and diary components of the Con- 
sumer Expenditure Survey are national probability samples of households 
designed to be representative of the total U.S. civilian population. Sam- 
pling occurs in two stages. The first stage of sampling involves the selection 
of primary sampling units (PSUs) that consist of counties, groups of coun- 
ties, and portions of counties. The PSUs are classified into four categories: 
(1) large metropolitan statistical areas (MSAs); (2) medium -sized MSAs; 

(3) nonmetropolitan areas that are included in the CPI; and 

(4) nonmetropolitan areas where only the urban population is included in 
the CPI. Lists of housing units in each PSU are constructed using decen- 
nial census data and supplemental information on new housing construc- 
tion. The second stage of sampling involves the selection of housing units 
from each PSU for participation in the CE survey. 

The interview component is a panel rotation survey. Each panel, a set 
of selected addresses, is interviewed for five consecutive quarters and then 
dropped from the survey. As one panel leaves the survey, a new panel is 
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introduced. Thus, approximately 20 percent of the addresses are new to 
the survey each month. For the 1996 and 1997 CE interview surveys, 
approximately 9,000 addresses were selected in each quarter. Allowing for 
nonresponses, the number of suitable interviews per quarter was targeted at 
approximately 5,400. Thus, more than 5,000 families participate in the 
interview survey in any given calendar year. 

The diary component involves drawing a new sample each year, inde- 
pendent both of previous years and of the sample for the interview compo- 
nent. Approximately 7,000 addresses were contacted for the 1996 and 
1997 CE diary surveys. Allowing for nonresponses, the number of house- 
holds providing usable diaries was targeted at approximately 5,400 per year. 

CONSTRUCTION OF THE CPI MARKET-BASKET SYSTEM 

The BLS prices the CPI market basket and produces the monthly CPI 
index using a complex, multistage sampling process. The first stage in- 
volves the selection of urban areas that will constitute the CPI geographic 
sample. Because the CPI market basket is constructed using data from the 
CE survey, the geographic areas selected for the CPI-U are also used in the 
CE survey. Once selected, the CPI geographic sample is fixed for 10 years 
until new census data become available. Using the information supplied by 
families in the CE surveys, the BLS constructs the CPI market basket by 
partitioning the set of all consumer goods and services into a hierarchy of 
increasingly detailed categories, referred to as the CPI item structure. 2 The 
levels of the CPI classification are: 

• All items 

• Major groups 

• Intermediate aggregates 

• Expenditure classes 

• Item strata (or categories) 

• Entry level items 

For example, in developing the current market basket the BLS has 
classified expenditures reported in the 1993-95 CE survey into more than 



2 Much of the material in this section and the next section is excerpted from CPI 
materials available at the Bureau of Labor Statistics web site, http://www.bls.gov. 




133 



APPENDIX A 



123 



200 item strata arranged into eight major groups: food and beverages; hous- 
ing; apparel; transportation; medical care; recreation; education and com- 
munication; and other goods and services. For each geographic area (pri- 
mary sampling unit) included in the CPI geographic sample, the BLS 
assigns each item category an expenditure weight, or importance, based on 
its share of total family expenditures. Aggregating weights from the geo- 
graphic areas in the CPI sample derives item category weights at the na- 
tional level. Thus, one can ultimately view the CPI market basket as a set 
of item strata and associated expenditure weights. 



MONTHLY DATA COLLECTION AND PRICING 

Following the sampling process, BLS analysts select the outlets (places 
where area residents make purchases), goods and services (specific items 
purchased), and residents’ housing units to be used: in computing the 
monthly CPI. Selection of the CPI oudet and item samples is based on 
information from the Telephone Point-of-Purchase Survey (TPOPS), a 
household survey that provides BLS with a sampling frame of oudets and 
retail establishments visited by urban consumers. The TPOPS obtains data 
from about 17,000 families annually on the types of goods and services 
'consumers purchase, the amount of these expenditures, and the places the 
expenditures are made. Since the 1998 CPI revision, TPOPS data have 
been collected using computer-assisted telephone interviews (CATI), which 
allows a portion of all commodities and services to be updated, or rotated, 
in each sampling unit every year. 

Within item categories, BLS statisticians select hundreds of entry-level 
items and match them with the sampled retail oudets for price collection. 
The number of price quotations and observations to be obtained is deter- 
mined statistically with the objective of producing the most accurate na- 
tional all-items index as possible, subject to available funds. The BLS field 
staff who collect CPI prices use the entry-level items as the starting point 
for the selection of the unique products or services — within the oudet — 
whose prices will be monitored. This selection is made using a random 
probability sampling method that reflects an items reladve share of sales at 
that particular store. 

Each month, BLS data collectors, called economic assistants, visit or 
call thousands of retail stores, service establishments, rental units, and doc- 
tors’ offices throughout the United States to obtain price information on 
the thousands of items used to track and measure price change in the CPI. 



0 



134 



124 



NAEP REPORTING PRACTICES 



These economic assistants record the prices of about 80,000 items each 
month. These 80,000 prices thus represent a scientifically selected sample 
of the prices paid by consumers for goods and services purchased. 

UPDATING AND IMPROVING THE CPI MARKET BASKET 

Because of the many important uses of the monthly CPI, there is great 
interest in insuring that the CPI market basket accurately reflects changes 
in consumption over time. Each decade, data from the U.S. census of 
population and housing are used to update the CPI process in three key 
respects: (1) redesigning the national geographic sample to reflect shifts in 
population; (2) revising the CPI item structure to represent current con- 
sumption patterns; and (3) modifying the expenditure weights to reflect 
changes in the item structure as well as reallocation of the family budget. 

In response to growing demands for a more current CPI market bas- 
ket, the BLS has redesigned some of the survey processes to enable more 
frequent revision than once every five or ten years. In particular, the new 
TPOPS sample design permits a shift to sample rotation by category rather 
than by geographic area, thereby facilitating accelerated sample rotation in 
product areas where the markets are most dynamic. The sample rotation 
involves (1) reselecting the retail stores and business establishments to be 
visited by BLS field representatives and (2) reselecting the unique products 
and services to be priced for the market basket. For example, to represent 
the market basket item category “records and tapes,” a cassette tape sold in 
Outlet A could be replaced by a compact disc sold in Outlet B. In addi- 
tion, the sample size of the ongoing CE survey has been increased substan- 
tially, which will enable the production of updated expenditure weights 
every two years starting in January 2002. 
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APPENDIX 

B 

Depicting Changes in Reading Scores — 
An Example of a Usability Evaluation 



To illustrate how the usability evaluation might work, we will focus on 
the redesign of a single data display from the report NAEP 1994 Reading: A 
First Look Report (Williams, Reese, Campbell, Mazzeo, & Phillips, 1995). 
This report is designed for a broad audience of policy makers, educators, 
and the press. Wainer and colleagues (1997a, 1999) redesigned several dis- 
plays from the report in accord with specific usability standards described 
in Visual Revelations (Wainer, 1997b). These revisions were evaluated 
through formal usability trials in which preference and comprehension 
measures were taken (Wainer et al., 1999). We discuss the design modifi- 
cations that resulted in one of Wainer s more successful redesigns and then 
illustrate how the processes shown in Box 6-1 (see Chapter 6) might be 
applied to make the illustration still more usable and accessible. 

The original display appears in Figure B- 1 and shows test scores as a 
function of administration date (1992 and 1994), grade (fourth, eighth, or 
twelfth), and geographic region (Central, Northeast, Southeast, and West). 
The format chosen is a perspective-view bar graph with region represented 
along the horizontal axis and grade represented in depth (z-axis). Scores for 
both years are shown, side by side, for each grade within each region. 
Numerical data values are placed above the tops of the individual bars. In 
his revision, Wainer selected a two-dimensional line-graph for these data, 
and he removed the raw numerical values from the display. Year of admin- 
istration was represented on the horizontal axis and all other conditions 
were labeled by line grouping (grade) or by individual line (region) directly 
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FIGURE B-l Wainer, H.; Hambleton, R.K., and Meara, K. (1999). Alternative dis- 
plays for communicating NAEP results: a redesign and validity study. Journal of Educa- 
tional Measurement, 36(4), 301-335. Copyright 1999 by the National Council on 
Measurement in Education; reproduced with permission from the publisher. 
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by the relevant display objects. He also included a legend to help readers 
identify individual lines. The revision appears in Figure B-2. 

WHAT WOULD WE LEARN FROM A USER NEEDS ANALYSIS? 

Before beginning to revise the display again, it is essential to have a list 
of user requirements based on the results of user-needs analysis. This would 
involve bringing together small “user panels’’ comprised of people repre- 
senting the range of individuals who may be exposed to NAEP data reports. 
Note that the emphasis here is on diversity rather than typicality of potential 
group members. Thus, parents with limited educational backgrounds 
should be included as well as educators who may have extensive back- 
grounds in educational testing. Policy makers with very different political 
agendas should be chosen, as well as members of the local and national 
press. 

Once user panels are established, then focus groups, semi-structured 
brainstorming sessions, individual interviews, and other related methods 
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FIGURE B-2 Wainer, H.; Hambleton, R.K., and Meara, K. (1999). Alternative dis- 
plays for communicating NAEP results: a redesign and validity study. Journal of Educa- 
tional Measurement, 36(4), 301-333. Copyright 1999 by the National Council on 
Measurement in Education; reproduced with permission from the publisher. 
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can be held to determine the expectations of group members. One of the 
most important questions in redesigning an existing display, is what the 
users would like to know. What kinds of conclusions would they like to be 
able to draw? By giving panelists the data sets in a number of formats 
(numerical data tables and existing graphs in the present case), it would be 
possible to see which interpretations are spontaneously made, as well as the 
order in which these conclusions are drawn. Since the data presentation 
format will influence the nature of these spontaneous interpretations 
(Carswell and Ramzy, 1997), it is important to consider the conclusions 
drawn from the various formats. Alternatively, the data parameters could 
be verbally described to them and panelists allowed the chance to ask ques- 
tions. For instance, they could be told: 
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We have average NAEP reading test scores from 1992 and 1994. These are 
reported separately for the 2 nd , 8 th , and 12 th grades. Data are also broken 
down by region — western, central, southeastern, or northeastern schools. 
What would you like to know about these data? 



Tracking panelists questions is an effective method for eliciting the infor- 
mational needs of potential users. 

To illustrate, suppose that these methods revealed that the following 
questions were asked of the 1992-1994 change data in the following order: 

(1) Were we (the United States as a whole) doing better or worse in 
1994? 

(2) Which regions were showing the most change and in which 
direction? 

(3) What kind of change occurred in my region? 

(4) How does the change that occurred in my region compare to that 
found in other regions? 

These questions should drive decisions about the content and structure of 
data displays. In addition, when performing usability tests on the compre- 
hensibility of the data display, users' abilities to answer these questions ac- 
curately should be a core criterion of design success. With the information 
needs of the users better understood, one or more usability analysts can 
perform a heuristic evaluation. 

HEURISTIC EVALUATION 
OF THE ORIGINAL AND REVISED DISPLAYS 

In the text that follows, we evaluate the original and revised displays 
(Figures B-l and B-2) of the 1992 and 1994 NAEP reading data by apply- 
ing the heuristics proposed for the review of NAEP reports (Box 6-1). In 
addition, we propose changes to be made in the next design iteration. 



Is the format compatible with the performance criterion selected? 

Suppose that the questions raised during a hypothetical user-needs 
analysis revealed that users were primarily interested in ordinal information 
(e.g., "Did scores increase or decrease from 1992 to 1994?” "Did region Xs 
scores in crease/ decrease more than region Y s?”). It is likely that the readers 
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would want quick access to this information. Thus, a graphical display, 
rather than a table, is the appropriate choice. This also suggests that dis- 
playing the exact data values in conjunction with the graph, as in the origi- 
nal bar chart, may be unnecessary and may even impede rapid access of the 
comparative information. Our revised display, like the two previous ver- 
sions, will be graphical. And, as with the previously revised display, we will 
not report numeric values. 



Is the structure of the display compatible with the structure of 
the data? 

This heuristic is probably not relevant in the present case. Besides test 
scores, two (theoretically) continuous variables are displayed in the present 
data set-grade level and year of test administration. However, the present 
data describe only three grade levels and two test years. Thus, we can say 
very little about the relationship between either of the latter two variables 
and test scores. 



Is the perceptual grouping of information compatible with the 
mental grouping users must perform to extract the information they 
want and need? 

The findings from our hypothetical user-needs analysis suggest that 
users clearly want to make comparisons and that they are most interested in 
comparing scores across test administration years. Thus, the two years for 
each of the region-grade combinations must be tightly grouped so that they 
can be perceived together. In the original graphic (Figure B-l), the two 
years were presented side by side, allowing grouping by proximity. In the 
revised graph (Figure B-2), the two data points were not close together 
relative to other data points, such as those showing test means for other 
regions; however, the two administrations for each region-grade condition 
were connected by a line. In the next revision of the graph, the 1992 and 
1994 values should be connected by line segments, but they should also be 
closer than in the first revision. 

A second issue is the relative tightness of the grouping of data pairs for 
1992 and 1994 values across the same region versus across the same grade 
level. That is, should all of the data for a region be grouped together or 
should all of the data for a single grade be grouped together? In the original 
graph (Figure B-l), the data for a given year appeared in the same horizon- 
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tal row perpendicular to the line of sight, while the data for a given region 
fell along a row parallel to the line of sight. Thus, grouping by region and 
grouping by grade are about equally strong. The first revision (Figure B-2) 
made grouping by grade stronger through spatial proximity, which allows 
easier access to comparisons among different regions within a grade level. 
Because our hypothetical user-need analysis suggested that comparisons 
among regions were of greater importance, we would propose continuing 
to group by grade level so that data from different regions appear side by 
side. We would further highlight regional comparisons by adding a re- 
gional boundary around (or “frame”) the data from each grade level. 



Is the level of numeric detail compatible with the reliability of the 
data and the needs of the reader? 

Based on our hypothetical findings, we would drop the numeric means 
from the graph, as in the first revision (Figure B-2). Given the users’ interest 
in the mean score changes from 1992 to 1994, reliability becomes impor- 
tant; that is, are the differences between the two mean scores reliable? 
Perhaps pairs of scores (i.e., pairs of bars in the original graph and line 
segments in the revised graph) could be coded as exceeding or not exceed- 
ing a specific reliability criterion. For example, in the original figure, pairs 
that were significantly different were coded with asterisks on one of the two 
bars. 



Is data salience compatible with data importance? 

As described above, statistically reliable changes in scores across test 
administrations should be differentiated from those that are not reliable. 
The asterisk used in the original figure (Figure B-l) is not highly attention 
getting. Color could be used for this purpose and, possibly, a more satu- 
rated color could highlight the reliable differences. 

In terms of the relative salience of other graphic elements, the revised 
graph clearly highlights changes in scores from 1992 to 1994 that are dif- 
ferent in magnitude or direction across the geographic regions. However, 
this salience may actually be misleading in making certain perceptual com- 
parisons across grade levels. On the other hand, the original graph does not 
clearly highlight unusual changes in scores. Its placement of individual 
data points on the page tends to call attention to fourth-grade scores be- 
cause they appear closer to the reader than the other scores in this “3-d” 
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graph. This organization would be warranted if based on the perception 
that the audience is most interested in the fourth-grade scores. Otherwise, 
this organization could be a misuse of salience cues. In the revised graph 
(Figure B-2), lengths of the lines connecting scores from the same grade- 
region will draw attention to the largest changes from 1992 to 1994. 



Is the data display compatible with working memory limits? 

One crude way of evaluating if a data display is compatible with work- 
ing memory limits is to simply count the number of groups of elements, as 
well as the number of elements in each of these groups. For example, the 
original graph (Figure B-l) could be described as 12 pairs of bars, or 12 
groups of two elements. The revised graph (Figure B-2) could be described 
as three groups of five lines. A closer look should be taken whenever the 
number of major groups, or number of elements within those groups, is 
greater than four. Thus, the “12 pairs” and “five lines” of the original and 
revised graphs, respectively, could pose some difficulties for working 
memory, depending on the tasks to be performed. If a reader is simply 
trying to count the number of times test scores appeared to decrease across 
the years, then exceeding the “rule of fours” is probably not a big problem. 
However, it might be different if an individual were trying to capture all 
instances of decreasing scores to generate causal hypotheses. One sugges- 
tion for the redesign of the original graph (Figure B-l) would be to create 
more distinctive groups for different grade levels. This would lead to three 
groups of four pairs of bars, which may help readers “chunk” information 
in working memory in a more manageable way. 

In the initial revision of the graph of reading scores (Figure B-2), 
two problems are evident. First, as noted, there are five lines in each of the 
three grade-level groupings. In addition to scores from the four regions, a 
fifth line represents mean scores across the entire United States. This would 
seem to be important data to represent directly, given our hypothetical 
users’ need to know how students in the United States are performing across 
the two years. However, it may not be necessary to know the mean value of 
test scores during both years to answer this question. Simply determining 
the overall pattern of the graphic — whether the lines seem to be mostly 
“going up” or “going down” — may suffice. Therefore, we would suggest 
removing the line showing the national means. 

A second problem relates to the use of legends to identify regions on 
the revised graph (a number of lines per group problem). Different point 




142 



132 



NAEP REPORTING PRACTICES 



symbols are used for each of the four regions, and the overall United States 
data are represented by a different line-style and point-symbol combina- 
tion. Memorizing five symbols can be difficult; a problem that can often 
be remedied by placing labels directly by the lines in a graph (Milroy & 
Poulton, 1978). An attempt was made to do this in the revised graph; 
nevertheless, because the lines overlap, the user must still rely on the sym- 
bols described in the legend. Again, dropping one of the lines would help 
the overlap problem that prevents use of the labels to the side of the lines. 
In addition, it would reduce the load on working memory by ensuring that 
readers are more likely to correctly identify the different lines, even when it 
is necessary to refer to the legend. 



Are physical properties of the stimuli compatible with our ability to 
detect, discriminate, and recognize these properties? 

Both the original graph and the revised graph use differences in posi- 
tion along an aligned scale to represent differences in performance between 
1992 and 1994 for each region-grade combination. According to work by 
Cleveland and McGill (1984, 1985), this is one of the most accurate per- 
ceptual comparisons that can be made. Comparisons across different 
regions and grades within a given year are also made by comparing points 
along a common scale in the revised figure (Figure B-2). In the original 
figure, comparisons across grades are based on differences in position of bar 
heights along nonaligned common scales. People are less accurate at these 
judgments. In the revised figure, comparisons of changes across region- 
grade conditions are to be made by comparing line slopes. Generally, people 
do not make accurate estimates of relative slopes. For a new revision of the 
graph, we would recommend devising a format that uses line lengths, which 
are more likely to be correctly interpreted. 

We should also be aware of the potential visual distortions or illusions 
that can occur in both the original and revised graphs. In the original 
graph, the use of linear perspective and other depth cues (e.g., occlusion) 
can lead to size illusions, with the size of the bars in the front of the graph 
underestimated relative to the ones in the back. With line graphs, designers 
should be aware that we often judge slope relative to nearby frameworks 
such as other lines. The revised graph (Figure B-2) demonstrates this type 
of illusion. For example, the central region appears to have a very large 
increase in fourth graders’ performance across the two-year testing interval. 
This change is actually only one-fourth the size of the decrease in scores 
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among twelfth graders for the same region. However, the line graph seems 
to show that the increase among fourth graders is at least as big as the 
decrease among twelfth graders. The reason for this misperception is that 
the slope of a line tends to be over or under estimated depending on the 
slopes of surrounding lines (and particularly lines that intersea the target 
line). Specifically, for the fourth-grade data, the positively sloping line for 
the central region intersects with a negatively sloping line for the northeast 
region. This presentation tends to accentuate the slope of each. This is 
known as the Poggendorf illusion (Hubei & Wiesel, 1965, 1979). We will 
attempt to avoid the use of both perspective and line slope in our revision 
of the NAEP reading scores graph. 



Is the organization of information in the display compatible with 
spatial metaphors and population stereotypes? 

When the purpose is to show regional differences, the display should 
consider cartographic conventions of representing North at the top of a 
map and West to the far left. A display that must order information about 
geographic regions across a page should conform to the left-for-West rule. 
In our case, this means that the following left-to-right arrangement of 
regions should be used: West, Central, Southeast, and Northeast. Neither 
the original or revised graphs use this ordering. In the original graph (Fig- 
ure B-l), the map convention is reversed, with the most eastern region on 
the left of the page. In the revised graph (Figure B-2), the regions are 
ordered according to their mean scores. 



Is the choice of display format and ornamentation compatible with 
the users’ preferences and biases? 

There is evidence that people are more likely to distrust data presented 
in perspeaive (3-D) displays (Carswell, Frankenberger, and Bernhard, 
1991), such as the original graph. Further, evidence suggests that people 
less familiar with graphs tend to feel less threatened by bar graphs than by 
line graphs (Vernon, 1952). In our revision of the graph, we will avoid the 
use of perspeaive and the use of traditional line graphs as well. 

THE REVISED GRAPH 

Based on the changes suggested by the heuristic evaluation described 
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above, we produced the graph shown in Figure B-3. Note that position on 
common aligned scales is maintained for comparisons of scores across ad- 
ministrations and across regions within a grade level. However, absolute- 
score comparisons across grade levels cannot be made with this format. 
Since the hypothetical user-needs analysis indicated that few users would 
try to make such comparisons, we felt justified in sacrificing this piece of 
information. In return, the revised graph enables the use of length judg- 
ments for comparing the magnitude of changes among different regions 
and grades. 

The data are grouped into three clearly demarcated panels by grade 
level. Within each grade level there are four lines, each representing the 
two mean scores for a region. Rather than connecting two points that are 
offset horizontally, the revised graph uses two points along the same vertical 
grid line to represent the two test administration dates. The end of the line 
representing the second administration is indicated by an arrowhead. For 
each grade level, the four regions are arranged from left to right using the 
West-to-East map convention. 

In addition, several other changes will simplify the presentation. The 
term “Midwest” was substituted for the term “Central” in order to stream- 
line the axis labels. The grade-level panels were offset from left to right to 
mimic the spatial metaphor of moving through the grades as if climbing a 
staircase. Footnotes and legends were deleted. Instead, a few explanatory 
comments were presented as part of the graphs tide where they are more 
likely to be read. 



A USABILITY TEST: 

IS THE NEW GRAPH BETTER THAN THE EARLIER VERSIONS? 

Even though we have a redesigned graph that incorporates findings 
from the user-needs analysis and the heuristic evaluation, we still would 
not know if the new design were actually better or preferred by users. 
Accordingly, the next step must be usability testing similar to that described 
by Wainer and colleagues (1999). The multiple versions of the graph should 
be viewed by different groups of subjects representative of the intended 
audiences. Users should be asked what they learned from the graph, and 
researchers should note whether or not users drew conclusions relevant to 
the major questions defined in the user needs analysis. These interpreta- 
tions can be timed, and follow-up questions can be asked to determine if 
users can access important information. Preference data should also be 
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FIGURE B-3 Changes in Regional NAEP Reading Scores from 1992 to 1994. The 
direction and length of arrows indicate the direction and size of the change in average 
scores. A diamond indicates that the average score remaind the same. 
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collected after allowing participating users to view all three versions of the 
graphs. There are many variations of usability tests, and many additional 
methods are described in Rubin (1994) and Neilsen (1993). 

If the graph were to be included in the next release of NAEP reports, 
then data on citations, requests for publication, and misinterpretations by 
the press can also be collected to gauge display comprehensibility and 
accessibility. These data should guide future revisions. 
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